Jérémie Galarneau [Thu, 13 Feb 2020 04:34:08 +0000 (23:34 -0500)]
sessiond: notification: refactor: split transmission and poll update
Split the notification transmission logic and its effect on a
notification_client from the logic tied to the management of the
notification thread.
This is to make it possible to send (or queue) notifications from the
notification thread or another thread. If another thread encounters an
error or a full socket buffer, a future mechanism will allow it to
signal the notification thread to update its private state (e.g. poll
mask).
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I35d8943cb11473d82b07a4dbc5a0f093cde25a79
Jérémie Galarneau [Tue, 11 Feb 2020 04:06:19 +0000 (23:06 -0500)]
sessiond: notification: synchronize notification client (and list)
Introduce reference counting to notification_client_list and add
locks to both notification_client and notification_client list.
Of important note, this is in preparation for the introduction of
an action executor thread. The aim of this change is not to make
any part of the notitication sub-system thread-safe in any "general"
sense.
The reference counting and locks are introduced to protect a very
specific usage scenario.
The main thread of the notification subsystem and the action executor
will interact through triggers, evaluations, and client lists.
If the action executor needs to send a notification to a list of
client during the execution of an action group, it obtains the client
list and acquires a reference to it. It then locks the list to iterate
on it, allowing it to send the notification(s) to all subscribed
clients.
Holding the list lock prevents the main thread from disconnecting and
subsequently destroying a client. Holding a reference to the list also
prevents the list from being reclaimed due to a concurent 'unregister
trigger' operation.
No provision for other access scenarios are taken into account.
Squashed fix, otherwise the tests would hang.
Fix: don't hold client lock while handling subscription changes
Holding the client's lock while handling subscription changes causes a
lock inversion between the client list lock and the client lock.
This happens when a client subscribes to a condition that evaluates to
'true' at the time of the subscription. When this happens, a
notification is sent right away and that communication will attempt to
acquire the client lock.
Holding the client lock for such a long period is not needed anyhow
and we can simply protect the communication state when it is actually
modified/used.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7f38f81fa8bc32e5384538acdffab0824862cff2
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I88a9f66d31e3127b2bcc3f04016c364e0b3fe9ce
Jérémie Galarneau [Mon, 10 Feb 2020 23:35:27 +0000 (18:35 -0500)]
sessiond: notification: introduce the notion of 'active' client
Since notification_clients are now accessed from multiple threads, it
is possible for a thread to access a client while it is being
"cleaned-up" following an error.
The 'active' communication flag allows a check to be performed before
any communication is attempted with a client. The communication is
considered 'active' once the handshake has been performed. It is
considered 'inactive' if a fatal protocol error occurs.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If16f83ef12fe9a6ef597ae464867011811389177
Jérémie Galarneau [Sat, 8 Feb 2020 04:21:40 +0000 (23:21 -0500)]
sessiond: notification: maintain an id to notification_client ht
In preparation for the addition of an action execution worker, add a
client_id_ht which will allow the action worker to send commands to
the notification thread referencing clients by 'id' rather than by
their socket.
This is done in order to prevent FD re-use races between the various
threads at play when a communication error occurs on a notification
client socket.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7b8774c23a4a2a7c41ed8c1652f23d29954a0771
Jonathan Rajotte [Fri, 22 May 2020 14:36:46 +0000 (10:36 -0400)]
Fix: tests: `pgrep -f` flags unrelated process as lttng-sessiond
Observed issue
==============
lttng-sessiond is not started by start_lttng_sessiond_opt and a vim
process is killed on stop_lttng_sessiond_opt.
Cause
=====
We uses "pgrep -f" with the interested pattern to gather the pids that
should be the lttng processes we are interested in. `pgrep -f` yields
false positives since it matches against the complete cmdline including
parent directory of the command and all arguments.
For example, the following will currently match for the sessiond
pattern:
vim src/bin/lttng-sessiond/notification-thread-internal.h
This prevents the launch of sessiond by start_lttng_sessiond_opt and end
up killing the vim process on stop_lttng_sessiond_opt.
Solution
========
To alleviate this, we propose a two stage lookup. The first stage uses
"pgrep -f" yielding potential candidates. The second stage performs
grep on the basename of the first field of the /proc/[pid]/cmdline
for each pid candidates.
The first field of /proc/[pid]/cmdline corresponds to the actual command.
We use the basename to ensure that we do not match on the path to the
executable.
Known drawbacks
=========
None
References
==========
https://review.lttng.org/c/lttng-tools/+/3043
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I479ebad27f4965ae16d4442a6fe58ff3157d76fa
Jérémie Galarneau [Tue, 11 Feb 2020 23:22:13 +0000 (18:22 -0500)]
logging: print human-readable thread names when logging
The lttng_thread interface used by the session daemon uniquely names
all threads. This name can be used to augment the thread's logging
statement with a human-readable name rather than using the pid/tid
tuple used elsewhere.
Additionally, the thread name is set using the pthread API so that it
is visible in GDB and other tools (e.g. htop).
Invocations of pgrep in the test utilities are replaced by 'pgrep -f',
which matches against the process name.
We are not the first to encounter this problem after renaming the main
thread, see
https://github.com/mongodb/mongo/commit/
726cafd713c7333640f8458ec9808ed4f678e3a7#diff-a9003101d1e4a99ac2d43d9b1b839587R122
pgrep uses the name name in /proc/$PID/status which contains the
thread name, not the executable name. In the case of the sessiond,
this is now "Main".
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I73dfe8683b2ea31f7ed0c2ffdfa8332f36e28f9b
Jérémie Galarneau [Tue, 11 Aug 2020 19:58:08 +0000 (15:58 -0400)]
sessiond: clarify the role of notification credentials
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ica7a7370fcc4d34c9af11bcb9e7435e19b29a6f8
Jonathan Rajotte [Tue, 24 Mar 2020 18:08:16 +0000 (14:08 -0400)]
Use lttng_trigger credentials to send evaluation to client
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I49b4b2aeda09d09b7d8630562660dac96f36b3e7
Jonathan Rajotte [Fri, 7 Feb 2020 22:40:46 +0000 (17:40 -0500)]
trigger: introduce refcounting
Will be used for listing and much more use.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic9ee8753257fcbe15fc1321b18582628328c0154
Jonathan Rajotte [Fri, 7 Aug 2020 20:22:36 +0000 (16:22 -0400)]
trigger: use condition and action ref counting to ease internal objects management
Currently a trigger object has multiple relationship types with the action and
condition associated to it.
On the API client side, the trigger does not own the action and condition
associated to it. The user is responsible for managing the lifetime of the
associated action and condition object.
On the sessiond side, triggers created from `lttng_trigger_create_from_payload`
currently require that before calling lttng_trigger_destroy, the action and
condition be fetched and deleted using their respective destructor. This
operation cannot be done inside `lttng_trigger_destroy` since the exposed API is
clear that the trigger does no own the objects.
We can facilitate the lifetime/ownership management of the action and condition
objects using their respective reference counting mechanism.
On a trigger creation, the trigger get references on both object. On destroy,
the trigger put the references on the objects.
From an API client perspective, nothing changes. Even better, it prevents
premature freeing of these objects, since the trigger have references to these
objects.
On the sessiond side, we can now move the actual ownership of the action and
condition object to the trigger object. This is done in
`lttng_trigger_create_from_payload` forcing a put of the local references to the
object, effectively moving the ownership to the trigger object.
Note that the `lttng_trigger_get_{action, condition}` do not `get` a reference
to the object before returning it. This is done to comply with the API that was
introduced back in 2.11 which does expect a client to call lttng_{action,
condition}_destroy on the returned object.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I31d196d00c886fcf42c1b16c55e693949373568b
Jonathan Rajotte [Fri, 7 Aug 2020 19:39:24 +0000 (15:39 -0400)]
condition: introduce reference counting
This will allows easier management of the trigger ownership of its associated
condition and action objects.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I46420b03bd4bf7948bc2d1f44985edfe86c27c61
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: tests: fd-tracker: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I24fde39587495f5f48f900898755e57c10fbf303
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: relayd index: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6a79fcdceebc0959bb66525daac288953e863bb3
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond comm relay: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2c42c7b7574ccec85524a1b0aed33d108725ad4a
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: compat time: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If28840c388baffd257e61ef1ecad7c1ea3e3fd68
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: kernel consumer: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I55873fe5500e00d725ebaa241f4054a793b15aee
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond ust-app: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ba505a8206f4ee8838fc3268e3904c496eb3802
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond notification thread: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ieabe7e8a2066b30ef8b413896d03f8ff824ddc99
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond kernel: change spaces to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3b8f476db2f9038fddf3b648dba0fa06f20db5d8
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond kernel: fix include style
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8959283af0cafa94b0939e0025b08b40ff86653e
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond consumer: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I367ffe89e3930b2e84ad8c70b95bd95a968d974c
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Idb749331397c2b43ea41a33f09bf35d15c4b8f8a
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond manage-consumer: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic897e04aa96bd58c6e8dd340f8537af86c2dd478
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: relayd: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4aa7d69ef823e4ea85820cadd23463869f98edf2
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond command: fix include style
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2f3d695a97d31ca5b098872abc566a8b261cdf9c
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond command: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib809c788f3bf32360cd68b16972636db3ef4dfd7
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: relayd trace-chunk: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I789edf8a6b9f893c6e3b5f72fb9001f8e68eebd7
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: relayd index: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9fe8b9fa536a89fd04e989cf8b3f5957de88df1e
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: relayd session: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I545728dc95f67113c0f7adf0cbc9fc742e4018c7
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: liblttng-ctl clear: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I423545aa9de0521b861926b29e21b187500e5fd0
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: liblttng-ctl destruction handle: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie5d8f2fc1e2ad358b287550065f24f8928c8304c
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: mi-lttng: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia6ebea7f227d05cc9ebb8cf84717cfaf59f3e222
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: time: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I922111fab4667a5af35ed45d9bbdee0bcb477b05
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: consumer-stream: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Idfe9f1202762bd9a026659d61e50112c81641c88
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: consumer metadata cache: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9abf075960a33bac01379fdc6fc79dca83f84ee2
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: run-as: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Icbf393b779cf66a93b49ecef4279a6db820b4968
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: utils: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I67341c2a849c340fc56cc03dbca9daaf7c9eddb1
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond rotation: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8a2192da7b69a5a1945e4605167e7d0c91365681
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: sessiond: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I09854bc28773944ad2b70cb7645dcd53e0ee35ab
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: trace-chunk: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0099bf493c8b80aadf7ac8c7b5668c0bb99d27dd
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: consumer: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I688ab6903064cee0ef43b4ee7d395e6a60eb729b
Jérémie Galarneau [Tue, 11 Aug 2020 16:22:19 +0000 (12:22 -0400)]
Clean-up: optional: change space to tabs
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6e08f4cc6198583f3dfd9dd3b02da88244e2aa9d
Jonathan Rajotte [Tue, 24 Mar 2020 16:15:22 +0000 (12:15 -0400)]
trigger: internal: add credentials information
These credentials are for internal use only. We plan on using them for
listing, run-as filter bytecode generation and more.
Introduce LTTNG_OPTIONAL_GET_PTR which is used to assert and return the
pointer for an LTTNG_OPTIONAL underlying element.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I70d7f79918693b52db9b384b56e6476f3449e71d
Jonathan Rajotte [Tue, 24 Mar 2020 16:11:41 +0000 (12:11 -0400)]
credentials: implement is_equal
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I526fd15b8365fd2a77b5aa5250a75d956fd8efc3
Jérémie Galarneau [Fri, 7 Feb 2020 23:25:36 +0000 (18:25 -0500)]
sessiond: clean-up: remove unused `state` function parameter
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id606bc0c0db713902cbab2fafed5c81a05ef5f08
Jérémie Galarneau [Mon, 10 Aug 2020 21:41:28 +0000 (17:41 -0400)]
.gitignore: add test_unix_socket
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic7df31b45ce2ad9806790c1041506ce96e1a9e8d
Jonathan Rajotte [Tue, 14 Jul 2020 17:00:06 +0000 (13:00 -0400)]
sessiond: comm: add fd count information for lttng-ctl -> sessiond comm
This allows bulk reception of fds incoming from lttng-ctl before doing
any object deserialization.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I16f054be1d8b8d1e7868784c9b4088750b9a3651
Jérémie Galarneau [Fri, 7 Aug 2020 02:05:25 +0000 (22:05 -0400)]
Tests: add unix socket wrapper unit tests
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0db99d77b47326a5e97c39679a053ba4be7f717c
Jérémie Galarneau [Thu, 6 Aug 2020 19:19:54 +0000 (15:19 -0400)]
payload-view: mark lttng_payload_view_get_fd_handle_count as const
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0f0b1620e04fa5a5ae06b096dc940504a7d7553e
Jérémie Galarneau [Wed, 5 Aug 2020 16:09:36 +0000 (12:09 -0400)]
Fix: uprobe: inequality comparison against NULL
The expression will not have its intended effect.
In lttng_userspace_probe_location_function_serialize: Comparing a
pointer against NULL using an operator such as < or >=.
`binary_fd` is now a fd_handle instance rather than a "raw" fd. All
instances of `binary_fd` are renamed to `binary_fd_handle` to prevent
such errors in the future.
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I57f1a3d5a01029084889a183881fac6f1fe9b6d9
Jonathan Rajotte [Tue, 7 Jul 2020 23:50:06 +0000 (19:50 -0400)]
Fix: action: lttng_action_group_add_action returns status invalid
IS_GROUP_ACTION macro was performed on the action to be added instead
of the action representing the group, `group` in this function.
This always resulted in error on call to lttng_action_group_add_action.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iecf133fb2165686d1e8abd2225a8f90df22c41dc
Jérémie Galarneau [Tue, 23 Jun 2020 22:06:53 +0000 (18:06 -0400)]
Fix: evaluation: dereference before NULL check in create_from_payload
An evaluation payload view is created from the view passed to
lttng_evaluation_create_from_payload. Since a view contains a const
copy of the _fds array, it must be initialized as the declaration site.
However, src_view is checked for NULL after the initalization. Coverity
rightfully warns that:
1429799 Dereference before null check
There may be a null pointer dereference, or else the comparison against
null is unnecessary.
In lttng_evaluation_create_from_payload: All paths that lead to this
null pointer comparison already dereference the pointer
earlier (CWE-476)
This is not reachable right now, but it is fixed to silence the warning
and prevent future mistakes.
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3ceace4117ff54265b1f0cf6a4c638aec95f2879
Jérémie Galarneau [Thu, 30 Jul 2020 03:13:16 +0000 (23:13 -0400)]
Tests: rework tracefile_count test to meet the tracefile count limit
The original test simply fires a test application and checks that file
count limits are not exceeded.
The test is reworked to produce events until the total stream size
stabilizes. Then the number of files is checked against the expected
limit.
Some shellcheck warnings are fixed as part of this change.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id9ca3e16c05898c4009b4fb8fcf7882fdcb9dcf1
Jérémie Galarneau [Fri, 31 Jul 2020 15:20:40 +0000 (11:20 -0400)]
Fix: extraneous empty/inactive flush on rotation out of a trace chunk
Observed issue
==============
A test (tests/regression/tools/tracefile-limits/test_tracefile_count)
occasionally fails on ppc64. The trace validation steps in the fails in
the case where the trace file count limit is set to 1.
Examining the resulting trace shows that the last packet of data
produced by the test application appears to be missing.
The test case enables a channel in "overwrite" mode. Normally, this
would guarantee that the last data produced will always be available in
the resulting trace.
Cause
=====
An empty/inactive flush is performed when rotating "out" of a non-null
trace chunk to ensure that the trace chunk contained at least one
packet.
Looking at the test's resulting trace and by following the consumerd
logs, we see that the test application runs on one CPU for most of its
lifetime. The stream file is repeatedly replaced to make room for the
latest data.
Eventually, the application is migrated to another CPU. A number of
packets are written to this new stream. The session is then stopped
which causes an active flush to occur to close the current packet of
all streams (see `ust_app_stop_trace_all`).
Then, when the session is destroyed, an empty/inactive flush is
performed to ensure that at least one packet was produced in the current
trace chunk [1].
At the moment of writing this empty packet, the consumer daemon sees
that there is not enough space left in the stream file to honour the
trace file size restriction. It thus overwrites the file resulting in
the loss of the last events to replace them with the empty "end of
chunk" packet that occupies a single page.
While the problem is not specific to PowerPC 64, it has a lot more
chances to occur there as pages are typically configured to be of 64kb
length. Due to current implementation limitations, empty packets have
a size of one page.
In other words, 4kb pages typically fit in the space left in the file,
causing the problem to not be easily reproducible on x64.
Note that while the file size limit is specified as "3 * PAGE_SIZE" in
the test, it is rounded-up to 512kb to accomodate at least one
sub-buffer.
Solution
========
[1] This empty/inactive flush is no longer necessary since
f96af312b
as an "open packet" (which performs an empty/inactive flush) is
performed when a stream enters a non-null trace chunk. There is no
concern that a trace chunk will be left empty unless this initial flush
fails (see patch comments and work-around).
The empty flush that was performed for data streams is converted
into an active flush under most circumstances; the packet is simply
closed.
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5602b7ab8318374f75060489cf9c27af4e058805
Jérémie Galarneau [Thu, 30 Jul 2020 16:14:21 +0000 (12:14 -0400)]
Fix: relayd: double unlock on viewer stream creation error
viewer_stream_create must be called with the relay stream's
lock held since
9edaf114d. A call to pthread_mutex_unlock
was forgotten in the error path of viewer_stream_create resulting
in a double-unlock in some error scenarios.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I613f67440669af52f625b7bdc4e9ae6ce75956ea
Jérémie Galarneau [Tue, 28 Jul 2020 15:29:18 +0000 (11:29 -0400)]
Fix: relayd: live connection fails to open file during clear
Issue observed
==============
A `session clear` test occasionaly fails on the CI (very rarely, and
more often on PowerPC executors for some reason) with babeltrace
reporting that the live connection was closed by the remote side:
PASS: tools/clear/test_ust 276 - Waiting for live viewers on url: net://localhost
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1701 [lttng-live] Error receiving get new streams reply
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
07-21 16:17:07.058 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x1014cca8, muxer-comp-addr=0x1015afd8, muxer-msg-iter-addr=0x1014d668, msg-iter-addr=0x1014d598, status=ERROR
07-21 16:17:07.059 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d598, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
07-21 16:17:07.059 23855 23855 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8, comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
07-21 16:17:07.059 23855 23855 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully
ERROR: [Babeltrace CLI] (babeltrace2.c:2548)
Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128,
comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
(`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8,
comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x1014d598, iter-upstream-comp-name="muxer",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
iter-upstream-comp-class-name="muxer",
iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
Cannot validate muxer's upstream message iterator wrapper:
muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
iter-upstream-comp-class-name="lttng-live",
iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
Error preparing the next batch of messages:
live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1701)
Error receiving get new streams reply
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198)
Remote side has closed connection
ok 277 - Clear session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 277 - Clear session J7WXjh7fmMleTCfE
ok 278 - Clear session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 278 - Clear session J7WXjh7fmMleTCfE
ok 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE
ok 280 - Destroy session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 280 - Destroy session J7WXjh7fmMleTCfE
# Wait for viewer to exit
not ok 281 - Babeltrace succeeds
Cause
=====
Looking at the relay daemon logs, it appears that the live client
requests an enumeration of the available streams while a rotation is
ongoing (clear).
Ultimately, this results in the relay daemon attempting to open a
non-existing file:
PERROR - 16:33:59.
242388809 [734380/734387]: Failed to open fs handle to ust/uid/1000/64-bit/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
The logs indicate that this file existed at some point. However, it
was unlinked and its newest instance was created in a trace chunk
named "20200727T163359-0400-1".
This chunk name is a temporary name used until the original trace
chunk can be unlinked (cleared) and the newest can be moved in its
place.
The file is being opened as part of the creation of a viewer stream
when make_viewer_stream() fails to find it. This implies that, somehow,
an outdated trace chunk is being used to open the viewer stream's file.
The reason why is that make_viewer_stream is called with the
viewer session's current trace chunk. During a rotation, the use of the
viewer session's current trace chunk is touchy due to the way the
switch-over to a new chunk is handled.
How viewer session/stream trace chunks are rotated
--------------------------------------------------
The viewer polls the relay daemon for new data/metadata to consume using
the `GET_NEXT_INDEX` and `GET_METADATA` commands. Both commands will
indicate to the viewer that it should try again later if a rotation is
ongoing on the "side" of the relay session.
When a rotation is not ongoing, the relay compares the `id` of the
target viewer stream's trace chunk with the relay session's current
trace chunk. If those `id`s don't match, the viewer session's current
trace chunk is then updated to a copy of the relay session's trace
chunk. The viewer stream's files are then closed and re-opened in the
context of the viewer session's now-updated trace chunk.
While the live protocol allows `GET_NEXT_INDEX` and `GET_METADATA` to
communicate to the viewer that it should "retry" later, there is no such
provisions made for the paths that lead to the creation of the viewer
streams. This means viewer streams must be created even if a rotation is
ongoing.
Solution
========
If a rotation is ongoing at the moment of the creation of the viewer
streams, we wish to use a copy of the trace chunk being used by the
relay stream. This way, we can be sure that the streams' files exist.
It is okay for viewer streams to hold references to different copies of
the trace chunks since no user-visible actions are performed when the
reference to those chunks is released. This is different from relay
streams where the detection of the completion of a rotation is done when
the last relay stream releases its reference to a specific trace chunk
instance.
Known drawbacks
===============
None beyond a slight increase (temporary until the next rotation) in the
number of FDs used when a client connects during a session rotation.
Note
====
Since make_viewer_streams now acts on relay and viewer counterparts of
the session and stream objects, the various variables are prefixed with
`relay` and `viewer` to make the code easier to understand.
The locking period of the relay stream is extended to most of the
iteration in make_viewer_stream() rather than only in
viewer_stream_create(). As a result, callers of viewer_stream_create()
must hold the relay stream's lock.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3b9c52a57835134712eadba0f68f6a21e922356b
Jonathan Rajotte [Tue, 28 Jul 2020 14:26:02 +0000 (10:26 -0400)]
Fix: sessiond: unchecked return value
From Coverity:
CID
1431048 (#1 of 1): Unchecked return value (CHECKED_RETURN)
1. check_return: Calling lttng_dynamic_buffer_set_size without
checking return value (as is done elsewhere 32 out of 40 times).
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id15632e3932052cf7dce5d57a08ac6efc84fc92f
Jonathan Rajotte [Tue, 28 Jul 2020 14:22:28 +0000 (10:22 -0400)]
Fix: common: unchecked return value
From Coverity:
CID
1431050 (#1 of 1): Unchecked return value (CHECKED_RETURN)
1. check_return: Calling lttng_dynamic_buffer_set_size without
checking return value (as is done elsewhere 32 out of 40 times).
Solution
========
Since we are inside a the clear operation for the payload, ignore the
return value.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I45c1054de7252aab4b77f330102ff4f489f20a6d
Jonathan Rajotte [Tue, 28 Jul 2020 14:19:30 +0000 (10:19 -0400)]
Fix: common: improper use of negative return
From Coverity:
CID
1431053 (#1 of 2): Improper use of negative value (NEGATIVE_RETURNS)
5. negative_returns: fd_count is passed to a parameter that cannot be
negative.
Solution
========
Check return value for fd_count and goto error if negative.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibdb2f065f0ebe51efae9125630a17730386395ac
Jonathan Rajotte [Tue, 28 Jul 2020 14:15:04 +0000 (10:15 -0400)]
Fix: sessiond: unchecked return value
From coverity:
CID
1431054 (#1 of 1): Unchecked return value (CHECKED_RETURN)
1. check_return: Calling lttng_dynamic_buffer_set_size without
checking return value (as is done elsewhere 32 out of 40 times).
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6b5b18d90f194c8081b91b3fbb80dccd29ba19ca
Jérémie Galarneau [Mon, 27 Jul 2020 22:14:30 +0000 (18:14 -0400)]
Fix: relayd: wrong stream type used in DBG statement
get_next_index() is never invoked for metadata streams; the
debug statement was erroneously copy-pasted from the get_metadata()
function.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia57dac7f102eddc78e24781905b390b4fa54bd80
Jérémie Galarneau [Fri, 24 Jul 2020 17:48:00 +0000 (13:48 -0400)]
sessiond: client: use common payload send fds util
Use lttcomm_send_payload_view_fds_unix_sock instead of an open-coded
version of the same functionality.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I36627423287a059e8b2ab1cc5cb42ca0db53e144
Jérémie Galarneau [Wed, 22 Jul 2020 20:15:44 +0000 (16:15 -0400)]
payload: use fd_handle instead of raw file descriptors
Using raw file descriptors with lttng_payloads introduces scary file
descriptor corner-cases when mixing with asynchroneous communication and
lttng_payloads.
Since an lttng_payload doesn't own its file descriptors, attempting it
is easy to fall into a situation where a file descriptor is referenced
by an lttng_payload while the owner is destroyed.
For instance, a userspace probe could be destroyed while its description
is waiting to be sent to a client.
The various use sites of the payload/payload_view APIs are adjusted.
Utilities to send/recv fds through unix sockets using the payload and
payload view interfaces are added as part of this commit as they use
the payload's fd_handles.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I44073f1b683af6c475e8e93c25a76910acf955e4
Jérémie Galarneau [Wed, 22 Jul 2020 19:29:17 +0000 (15:29 -0400)]
Add fd_handle interface
An fd_handle allows the reference counting of file descriptors which may
be shared by multiple objects. There is no synchronization imposed (or
provided) for the use of the underlying file descriptors as this utility
meant to be used on file descriptors where this would not make
sense (eventfd, dir fd, etc.)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1bdeb48caddda125aa8f41afd502df2eb4a0b8b2
Jérémie Galarneau [Wed, 22 Jul 2020 20:43:26 +0000 (16:43 -0400)]
payload: don't re-initialize destination payload on copy
This makes it possible to re-use the same destination payload
and not always re-allocate memory for its underlying arrays.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I959b2e8851a81034b681d7188c504752620e1b50
Jérémie Galarneau [Fri, 19 Jun 2020 22:44:56 +0000 (18:44 -0400)]
uprobe: transmit binary file descritptor through lttng_payload
Adapt the userspace probe objects to use the lttng_payload interface.
This streamlines the acquisition of the file descriptors when those
objects are serialized.
File descriptors are transmitted in both directions between liblttng-ctl
and the session daemon making it possible (and safe) to compare
userspace probe instances.
Currently the event listing API does not allow us to express userspace
probe locations that contain a file descriptor. This is an unfortunate
consequence of returning a "flat" array to list events.
Indeed, we can't store a file descriptor in the userspace probe
locations returned to the user in this API since the destructors of the
probe locations are never called. The user simply free()'s the returned
array, which would leak the file descriptors.
The consequence of this is that we can't allow the creation of event
rules using a probe location returned by an lttng_list_events() call.
This is not unsolvable, but I'm not sure if there really is a use-case
for this.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0f710cbe9deabfd163206fd03618eba183f3d1d2
Jérémie Galarneau [Tue, 23 Jun 2020 04:23:01 +0000 (00:23 -0400)]
Fix: payload view: payload view always refers to parent's position
A payload view's fd iterator must point to the root view's fd iterator
and not necessarily its parent's. This would cause the iterator
to be reset when views were nested more than two levels deep.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I37eab2ca1437b1a8446449984619106ffff90d3a
Jérémie Galarneau [Tue, 23 Jun 2020 03:35:02 +0000 (23:35 -0400)]
Fix: lttng: leak of userspace probe path on listing
realpath() returns a malloc'ed string when `resolved_path` is NULL.
`binary_path` can be safely free'd after it is printed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic35862f3275f58274ae2572e1d6be736f8a2107e
Jérémie Galarneau [Tue, 23 Jun 2020 03:10:18 +0000 (23:10 -0400)]
payload: incomplete sentence in lttng_payload_init comment
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic23e7264daa2b55028bfe6298d5c41588bf73919
Jérémie Galarneau [Fri, 19 Jun 2020 22:54:58 +0000 (18:54 -0400)]
common: add lttng_payload_view fd count accessor and buffer init
Allow the initialization of a payload view from a subset of a dynamic
buffer (echoing the lttng_buffer_view API) and add an accessor for the
fd count property.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I647e955b625230010a8789df88f93cf19487ce58
Jérémie Galarneau [Fri, 19 Jun 2020 22:43:47 +0000 (18:43 -0400)]
common: move lttng_payload[_view] to libcommon
The payload utils are moved from libsessiond-comm to libcommon since
they present a circular dependancy: a payload uses the dynamic buffer
utilities while the actions, triggers, and conditions make use of the
lttng_payload utilities.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib448bde0e442e7660bb7a61cd629f29442715b21
Jérémie Galarneau [Fri, 19 Jun 2020 22:20:08 +0000 (18:20 -0400)]
common: add lttng_dynamic_array_set_count()
Add an lttng_dynamic_array_set_count() helper to resize a dynamic
array.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic38efdb9c2ee6a4176fcde93e4b0d45052d14799
Jérémie Galarneau [Tue, 16 Jun 2020 04:14:13 +0000 (00:14 -0400)]
sessiond: prepare client replies through an lttng_payload
Modify the command context structure to contain an lttng_payload. This
allows commands to return a payload which contains a file descriptor
without accessing the socket directly.
An interesting side-benefit is that, in practice, this eliminates all
dynamic allocations from the client communications beyond the first
command served. The command context is re-used and the reply buffer is
allocated once and not released (only its size is reset to 0).
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id37b300648db8a9ebec7522e24c70e377c29cddc
Jérémie Galarneau [Tue, 16 Jun 2020 03:48:37 +0000 (23:48 -0400)]
Clean-up: sessiond: change spaces to tab
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iac3eae770fa258f0ff1ea7dddbdc93ac32f4a2dc
Jérémie Galarneau [Mon, 15 Jun 2020 22:44:23 +0000 (18:44 -0400)]
Fix: uprobe: missing error code on allocation failure
lttng_userspace_probe_location_function_create_from_payload
does not report an allocation failure when strdup fails.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibed9906cabeca81081647d220a7aef1f54d251d4
Jérémie Galarneau [Mon, 15 Jun 2020 22:43:02 +0000 (18:43 -0400)]
Fix: sessiond: don't negate error code on list error
Listing errors are already negative. Negating in the error path
causes error codes to be interpreted as a number of events and
cause a communication error further on.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6df6fa14e4e443798dfdc1aa7a8fd7cfe380835b
Jonathan Rajotte [Fri, 31 Jan 2020 21:05:16 +0000 (16:05 -0500)]
userspace-probe: implement is_equal
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I11ba6c607f5b4b4556dd9048a140a7b0cdda7e4f
Jonathan Rajotte [Thu, 9 Jul 2020 20:22:19 +0000 (16:22 -0400)]
Fix: send/received actual size is overwritten by 'expected' size
Observed issue
==============
None.
Cause
=====
None.
Solution
========
Return the actual send/received size. Caller code correctly handles
partial send/receive.
Known drawbacks
=========
None
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie52c39d0ab778108e3fde1b2375c0535caec97b1
Jonathan Rajotte [Fri, 10 Jul 2020 12:46:04 +0000 (08:46 -0400)]
unix: add non block send and receive flavors for fd passing
These will be used by the notification subsystem.
It is important to note that based on our current knowledge the sending
/receiving of fds is an all or nothing scenario. The fds are actually
part of the control message instead of the `payload`. On the receiving
side, a reception of N fds will only yield a "read" count of 1 bytes on
reception.
Albeit we don't have to account for the partial send/receive, we have to
manage the EAGAIN/EWOULDBLOCK scenario off non-blocking socket.
The caller of these function must handle the following scenario:
ret < 0 -> error
ret == 0 (Nothing received/sent)
ret > 0 -> success
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iabf8de876991c9f1c5b46ea609f9af961c4a7ab9
Jonathan Rajotte [Thu, 9 Jul 2020 18:57:57 +0000 (14:57 -0400)]
Fix: partial recv lead to client disconnect
On EWOULDBLOCK and EAGAIN, the ret value from
`lttcomm_send_unix_sock_non_block` and
`lttcomm_recv_unix_sock_non_block` would be equal to -1.
Solution
=====
For such cases, set ret equal to zero since it is effectively what was
received/sent. Caller should treat ret >= 0 as `success` anyway.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I47677f26850bf544f672d270972de7cc2c6a2817
Michael Jeanson [Mon, 13 Jul 2020 19:41:01 +0000 (15:41 -0400)]
tests: return the proper TAP exit code
The C TAP library provides the 'exit_status()' function that will return
the proper exit code according to the number of tests that succeeded or
failed.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0de2349609eb34b1c5e58f09012c1db0126923c0
Michael Jeanson [Tue, 7 Jul 2020 19:16:19 +0000 (15:16 -0400)]
Add tests/unit/test_payload to .gitignore
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic792326a78f6eac8d12682dc23642290f14289ff
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Francis Deslauriers [Thu, 9 Jul 2020 20:44:32 +0000 (16:44 -0400)]
Tests: live/test_{lttng_,}kernel: use lttng_test_filter_event instead of sched_switch
Background
==========
These tests currently rely on system load (the `sched_switch` event) to
generate trace data.
Issue
=====
This is can be problematic for the `test_kernel`
test case because it has a fixed sized buffer to store the trace:
#define mmap_size 524288
This caused this test failure to randomly happen on my machine:
ok 7 - Get one index per stream
# mmap_size not big enough
not ok 8 - Get one data packet for stream 0, offset 0, len 4096
# Failed test (live_test.c:main() at line 709)
[error] Error detaching viewer session
not ok 9 - Detach viewer session
# Failed test (live_test.c:main() at line 715)
Solution
========
Using the `lttng_test_filter_event` event to control the size and
number of the event expected in the trace rather then depending on how
many Electon apps are currently fighting for my CPUs.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I15d500d5becf9c5e526ae11ff0b2a2f4f6b753ac
Francis Deslauriers [Thu, 9 Jul 2020 20:30:14 +0000 (16:30 -0400)]
Cleanup: Tests: live/test_{lttng_,}ust: testapp not in background
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib1f8046acf219f4f070419dc29059d187844976a
Francis Deslauriers [Thu, 9 Jul 2020 21:38:17 +0000 (17:38 -0400)]
Fix: consumer: Move sanity check within `consumer_subbuffer` functions
The sanity check on the number bytes written by the `consumer_subbuffer`
callback was not correct channel configured in the splice output type in
a live session.
To simplify this, move checks in the callback themselves so they can be
specialized.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4e47a305860684c461ba7ffffd5e3bb3a21990b0
Francis Deslauriers [Tue, 7 Jul 2020 16:27:34 +0000 (12:27 -0400)]
Cleanup: typo in DBG() statements
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I19518292b4e7bd9319cafa05f08c1854f7693168
Jonathan Rajotte [Tue, 21 Jul 2020 15:00:40 +0000 (11:00 -0400)]
Fix: use sys/types.h for ssize_t on Cygwin
Observed issue
==============
On cygwin worker:
In file included from snapshot.c:10:
../../src/common/snapshot.h:33:1: error: unknown type name `ssize_t`; did you mean `_ssize_t`?
33 | ssize_t lttng_snapshot_output_create_from_buffer(
| ^~~~~~~
| _ssize_t
snapshot.c:128:9: error: conflicting types for `lttng_snapshot_output_create_from_buffer`
128 | ssize_t lttng_snapshot_output_create_from_buffer(
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from snapshot.c:10:
../../src/common/snapshot.h:33:9: note: previous declaration of `lttng_snapshot_output_create_from_buffer` was here
33 | ssize_t lttng_snapshot_output_create_from_buffer(
|
Solution
========
Include sys/types.h.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1df58ffb6df02d6957e1e4eac6ebfb297f3e3bb0
Michael Jeanson [Mon, 6 Jul 2020 16:18:01 +0000 (12:18 -0400)]
Add kernel and UST time namespace context
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I37a67cd61d55e1b94a9f54712143daeabfb47e84
Jérémie Galarneau [Tue, 21 Jul 2020 16:51:48 +0000 (12:51 -0400)]
Fix: sessiond: wrong variable checked for error code
Coverity reports:
1430513 Logically dead code
The indicated dead code may have performed some action; that action
will never occur.
In ust_app_open_packets: Code can never be reached because of a
logical contradiction (CWE-561)
The `open_ret` variable should be checked instead of `ret` after calling
consumer_open_channel_packets.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I49ad2d2b747e29414fee7f0727df1feb343695bb
Jérémie Galarneau [Tue, 21 Jul 2020 16:44:59 +0000 (12:44 -0400)]
Fix: consumerd: double unlock on rotate channel error path
Coverity Scan reports:
1430541 Double unlock
May result in undefined behavior.
In lttng_consumer_rotate_channel: Attempt to release a non-recursive
lock that is not held (CWE-765)
The error path should jump to end_unlock_channel as the stream
lock is only held for the duration of the call to
consumer_stream_open_packet.
This bug was introduced by the previous commit.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2de684eece963973d0861e5924263575363484f6
Jérémie Galarneau [Mon, 20 Jul 2020 19:29:41 +0000 (15:29 -0400)]
Fix: consumerd: packet sent before channel rotation
Issue observed
==============
A clear test occasionally fails with the following output:
# Test ust streaming rotate-clear
# Parameters: tracing_active=0, clear_twice=1, rotate_before=0, rotate_after=0, buffer_type=uid
ok 605 - Create session S0BXcJKWrmAwNSzm with uri:net://localhost and opts:
PASS: tools/clear/test_ust 605 - Create session S0BXcJKWrmAwNSzm with uri:net://localhost and opts:
ok 606 - Enable channel chan for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 606 - Enable channel chan for session S0BXcJKWrmAwNSzm
ok 607 - Enable ust event tp:tptest for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 607 - Enable ust event tp:tptest for session S0BXcJKWrmAwNSzm
ok 608 - Start tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 608 - Start tracing for session S0BXcJKWrmAwNSzm
ok 609 - Rotate session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 609 - Rotate session S0BXcJKWrmAwNSzm
ok 610 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 610 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
ok 611 - Clear session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 611 - Clear session S0BXcJKWrmAwNSzm
ok 612 - Clear session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 612 - Clear session S0BXcJKWrmAwNSzm
Error: Relayd rotate streams replied error 97
Error: Relayd rotate stream failed. Cleaning up relayd 33
Error: Relayd send index failed. Cleaning up relayd 33.
Error: Rotate channel failed
Error: Stream 76 relayd ID 33 unknown. Can't write index.
Error: Stream 74 relayd ID 33 unknown. Can't write index.
Error: Stream 72 relayd ID 33 unknown. Can't write index.
ok 613 - Start tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 613 - Start tracing for session S0BXcJKWrmAwNSzm
not ok 614 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
FAIL: tools/clear/test_ust 614 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
# Failed test 'Stop lttng tracing for session S0BXcJKWrmAwNSzm'
# in ./tools/clear//../../../utils/utils.sh:stop_lttng_tracing_opt() at line 1311.
ok 615 - Validate trace for event tp:tptest, 1 events
PASS: tools/clear/test_ust 615 - Validate trace for event tp:tptest, 1 events
not ok 616 - Read a total of 1 events, expected 4
FAIL: tools/clear/test_ust 616 - Read a total of 1 events, expected 4
# Failed test 'Read a total of 1 events, expected 4'
# in ./tools/clear//../../../utils/utils.sh:validate_trace_count() at line 1764.
Error: Failed to perform an implicit rotation as part of the destruction of session "S0BXcJKWrmAwNSzm": Unknown error code
not ok 617 - Destroy session S0BXcJKWrmAwNSzm
FAIL: tools/clear/test_ust 617 - Destroy session S0BXcJKWrmAwNSzm
# Failed test 'Destroy session S0BXcJKWrmAwNSzm'
# in ./tools/clear//../../../utils/utils.sh:destroy_lttng_session() at line 1347.
# Test ust streaming clear-rotate
Looking at the relay daemon log when the problem is reproduced,
we see:
Error: Protocol error: received a packet for a stream that doesn't have a current trace chunk: stream_id = 1, channel_name = chan_0
Cause
=====
The "rotate channel" consumer command iterates over a channel's streams
to perform a rotation and open a new packet when necessary
(see comments).
In the case where a channel is associated with a relay daemon, the
rotation positions are accumulated to send a single "rotate channel
streams" command to the relay daemon. This is done to reduce the time
needed to complete a rotation when tracing to a relay daemon through an
high-latency network connection.
Unfortunately, this causes packets to be opened before the rotation
command was sent to the relay daemon as the "open packet" command
is performed during the iteration on the streams.
Solution
========
Streams for which a packet should be opened are accumulated into an
array of stream pointers. The "open packet" is performed after a
successful rotation of the streams as a second "pass".
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6cb0b92619da73ebf223e1dfee88530e4244b56b
Jérémie Galarneau [Mon, 20 Jul 2020 15:48:23 +0000 (11:48 -0400)]
Clean-up: relayd: missing space in debug statement
Missing space in rotation debug statement results in logs of the form:
[...] Rotate stream 1at sequence number 4 [...]
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7f950224f52164d30f896d0b2bcc59304f842cf1
Jérémie Galarneau [Fri, 17 Jul 2020 18:26:17 +0000 (14:26 -0400)]
Fix: relayd: wrong specifier used in DBG format string
`len` is of type uint64_t while the format string specifies a size of
`%zd`. This results in a warning on most 32-bit architectures.
In file included from ../../../src/common/common.h:12:0,
from live.c:33:
live.c: In function `viewer_get_metadata`:
../../../src/common/error.h:161:35: warning: format `%zd` expects
argument of type `signed size_t`, but argument 6 has type `uint64_t
{aka long long unsigned int}` [-Wformat=]
#define DBG(fmt, args...) _ERRMSG("DEBUG1", PRINT_DBG, fmt, ## args)
^
../../../src/common/error.h:136:51: note: in definition of macro `__lttng_print`
fprintf((type) == PRINT_MSG ? stdout : stderr, fmt, ## args); \
^~~
../../../src/common/error.h:161:27: note: in expansion of macro `_ERRMSG`
#define DBG(fmt, args...) _ERRMSG("DEBUG1", PRINT_DBG, fmt, ## args)
^~~~~~~
live.c:2051:4: note: in expansion of macro `DBG`
DBG("Failed to read metadata: requested = %zd, got = %zd",
^~~
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I108191e0a94e712b0596d0d52a21cf8c971d6c1e
Jérémie Galarneau [Thu, 16 Jul 2020 21:17:08 +0000 (17:17 -0400)]
Tests: add a "new metadata after clear" test
Add a test that validates that the relay daemon performs a
metadata stream rotation after a clear.
The test enables a single event and launches a test application to
trigger it 10 times. Once the 10 event occurrences have been seen
by the live client, the session is cleared.
Then, two new events (statedump start and end) are enabled. This causes
new event descriptions to be appended to the metadata stream.
After a clear, the relay daemon should rotate the metadata stream
and send the new contents to the client. The client will then
be able to decode the statedump start/end events.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie6057577e3ee8a7c3ed1a84941308d10d372c16c
Jérémie Galarneau [Thu, 16 Jul 2020 16:39:20 +0000 (12:39 -0400)]
Fix: relayd: send_viewer_streams sends stack data in padding
A single stack-allocated instance of `struct lttng_viewer_stream` is
used to send the various streams to the live viewer. This structure
contains a path and channel name which remain uninitialized beyond the
null terminator.
The structure is zeroed on every iteration.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1c60ee18e01e8f56157f0fe44bd3bd64b1194703
Jérémie Galarneau [Wed, 8 Jul 2020 18:57:40 +0000 (14:57 -0400)]
Clean-up: consumer: move open packet to post_consume
Move the "open packet" step of read_subbuffer to a post-consume callback
as this only needs to be done for data streams; it does not belong in
the core of the read_subbuffer template method.
Change-Id: Ia4d3f8f833e213a8d0e39bcf5ec766c2c05bcf80
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 26 Jun 2020 22:40:12 +0000 (18:40 -0400)]
Fix: stream intersection fails on snapshot of cleared session
Observed issue
==============
In the following scenario:
lttng create --snapshot
lttng enable-event -u -a
lttng start
taskset -c 0 <tracepoint producing app>
lttng clear
taskset -c 0 <tracepoint producing app>
lttng snapshot record
lttng destroy
When using the stream-intersection mode, babetrace complains that the
time range for the intersection is invalid since the begin timestamp is
after the end timestamp.
This is caused by the presence of "inactive" streams for which no events
are recorded between the clear action and the recording of the snapshot.
These streams have a begin timestamp roughly equal to the moment when
the snapshot was taken (i.e. the end timestamp). Babeltrace, in
stream-intersection mode, attempts to use the latest beginning timestamp
of all streams as the start of the intersection and the earliest end
timestamp as the end boundary.
Path │File size │Packets │Timestamp: beginning │Timestamp: end │
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_0 │ 4.000 KiB│ 1│2020-06-22 21:26:01.
903685878│2020-06-22 21:26:17.
630456312│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_1 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
630909310│2020-06-22 21:26:17.
630909310│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_2 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
631295033│2020-06-22 21:26:17.
631295033│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_3 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
631673614│2020-06-22 21:26:17.
631673614│
Cause
=====
The packet beginning timestamps of the buffers are initialized on
creation (on the first "start" of a tracing session). When a "clear" is
performed on a session, all open packets are closed and the existing
contents are purged.
If a stream is inactive, it is possible for no packet to be "opened"
until a snapshot of the tracing session is recorded.
Solution
========
A new consumer command, "open channel packets" is added. This command
performs a "flush empty" operation on all streams of a channel.
This command is invoked after a clear (after the tracing is re-started)
and on start. This ensures that streams are opened as soon as possible
after a clear, a rotation, or a session start.
Known drawbacks
===============
In the case of an inactive stream, this results an extra empty packet at
the beginning of the inactive streams (typically 4kB) in the snapshots.
In the case of an active stream, this change will cause the first packet
to be empty or contain few events. If the stream is active enough to
wrap-around, that empty packet will simply be overwritten.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I19b5c630fa8bddfb13c3c10f86c6cc9dc4990b08
Jérémie Galarneau [Mon, 13 Jul 2020 22:42:22 +0000 (18:42 -0400)]
Fix: relayd: viewer metadata is not rotated after a session clear
Issue observed
==============
Following a session clear, babeltrace sometimes doesn't receive
the content of the metadata that is announced in the get_metadata
reply header.
This causes babeltrace2 to assert (a fix has been submitted) since
the protocol state becomes de-synchronized, causing babeltrace to
interpret follow-up replies as garbage.
This was occasionally observed on the CI when running the "clear" tests.
Cause
=====
There are no provisions made for rotating a viewer metadata stream when
a clear is performed on a live session. This means that a client can
request metadata that is only present in a newer chunk.
In the case of the crashes observed on the CI, the relay daemon attempts
to service a get_new_metadata request of size `4096`. It then fails to
read the data (as it was never present in the original trace chunk).
The relay daemon does not interpret the `0` returned by lttng_read() as
an error and sends a reply announcing `4096`bytes of metadata and no
payload.
Solution
========
Two fixes are rolled into this patch.
First, the return of lttng_read is checked for `0` and that situation is
handled as an error. However, this still leaves the problem of the
metadata stream not being rotated.
Secondly, the metadata relay_stream is checked for a rotation on every
`get_metadata` command. If a rotation has been detected, a viewer
rotation is performed on the metadata stream (very similar to the data
stream).
This solves the problem, but it leaves a case which the protocol does
not account for.
Essentially, the following can take place:
- relayd sets the "NEW_METADATA" flag as part of a `get_next_index`
query reply
- A rotation of the metadata stream occurs, no data is sent.
- client requests metadata
- metadata sent > received (was reset to 0 as part of the rotation)
In this scenario, the current implementation returned NO_NEW_METADATA,
but it is erroneous. Returning this guarantees to the viewer that it
will be able to decode all data packets that follow (until new metadata
is signalled, if ever).
Ideally, we would return a `RETRY` code, as is done by the data stream
handler when it detects that a rotation is taking place. Unfortunately,
such a code doesn't exist for the `get_metadata` command.
We return ̀ OK` with a length of 0, which is technically correct since
viewers are supposed to fetch metadata until the relay daemon returns
the `NO_NEW_METADATA` status code. However, supporting this has required
changes to babeltrace2's lttng-live source component.
I'm anticipating that most implementations don't handle the 0-length
case any better.
Known drawbacks
===============
Older viewers may not handle `OK` replies with a length of 0 gracefully.
Sending `NO_NEW_METADATA` is not an option as it breaks the guarantee
that all necessary metadata will be received before `NO_NEW_METADATA`.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I99cc1cea15125b48f5ee4e407d782b34b5e3dfb5
Jérémie Galarneau [Thu, 25 Jun 2020 23:22:24 +0000 (19:22 -0400)]
Fix: post-clear trace chunk has a late beginning packet
Observed issue
==============
In the following scenario:
- create a regular session
- enable an event that occurs on only one core
- start tracing
- trace for a while
- stop tracing
- clear the session
- start
- trace for a while (referred to the "active period" later on)
- stop
The resulting trace will contain a very short stream intersection as the
active stream will contain packets spanning the entire active period.
However, all other streams will contain a single packet at the end of
the active period with a duration of 0 ns.
This presents two problems:
1) This makes the stream intersection mode of viewers unusable,
2) This misleads the user into thinking that tracing was not active
for some buffers.
Cause
=====
The packet beginning timestamps of the buffers are initialized on
creation (on the first "start" of a tracing session). When a "clear" is
performed on a session, all open packets are closed and the existing
contents are purged.
If a stream is inactive, it is possible for no packet to be "opened"
until the "stop" of the tracing session.
On stop, a "flush_empty" is performed. Such a flush opens a packet
(if it was not already the case), closes it, and marks the packet as
being ready for consumption.
Solution
========
Attempt to flush an empty packet as close to the rotation point as
possible. In the event where a stream remains inactive after the
rotation point, this ensures that the new trace chunk has a beginning
timestamp set at the begining of the trace chunk instead of only
creating an empty packet when the trace chunk is stopped.
This indicates to the viewers that the stream was being recorded, but
more importantly it allows viewers to determine a useable trace
intersection.
This presents a problem in the case where the ring-buffer is completely
full.
Consider the following scenario:
- The consumption of data is slow (slow network, for instance),
- The ring buffer is full,
- A rotation is initiated,
- The flush below does nothing (no space left to open a new packet),
- The other streams rotate very soon, and new data is produced in the
new chunk,
- This stream completes its rotation long after the rotation was
initiated
- The session is stopped before any event can be produced in this
stream's buffers.
The resulting trace chunk will have a single packet temporaly at the end
of the trace chunk for this stream making the stream intersection more
narrow than it should be.
To work-around this, an empty flush is performed after the first
consumption of a packet during a rotation if the initial flush failed.
The idea is that consuming a packet frees enough space to switch packets
in this scenario and allows the tracer to "stamp" the beginning of the
new trace chunk at the earliest possible point.
Note that metadata streams are always skipped when opening a packet.
This is done for two reasons:
1) Timestamps are not relevant to the metadata stream
2) Inserting an empty packet in the metadata stream after a rotation
breaks the use of "clear" in live mode.
The contents of the metadata streams of successive chunks must be
strict superset of one another as live clients only receive the
information appended to a metadata stream (i.e. the parts it
already has received can't change).
If a flush_empty was performed after a clear/rotation, it would
result in an empty packet being inserted at the beginning of the
metadata stream that wasn't present in the first chunk.
This would cause the live client and relay daemon to have
mismatching copies of the metadata stream.
Known drawbacks
===============
In the case of an inactive stream, this results in the completed trace
chunk archive containing an extra empty packet at the beginning of the
stream (typically 4kB).
In the case of an active stream, this change will cause the first packet
to be empty or contain few events.
Those are all efficiency losses that are inevitable (AFAIK) given the
current buffer control APIs. It will be possible to recoup those losses
if an API allowing the consumer daemon to open a new packet is
introduced.
As noted in the comments, this patch is not final. The flush after the
rotation should not be open-coded in lttng_consumer_read_subbuffer. It
should be a data-stream specific "post-consume" step.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7d8ab876e55e9d0718a55ec1bb77ec6466accc02
Jérémie Galarneau [Tue, 7 Jul 2020 22:55:19 +0000 (18:55 -0400)]
Fix: kconsumer: missing wait for metadata thread in do_sync_metadata
The `do_sync_metadata` function is invoked everytime a data sub-buffer
is consumed in live mode.
In the user space case, lttng_ustconsumer_sync_metadata() returns
EAGAIN (positive) when there is new metadata to consume. This causes the
"metadata rendez-vous" synchronization to take place. However, the
kernel variant of this function returns 0 when there is new data to
consume, causing the "rendez-vous" to be skipped.
I have not observed an issue caused by this first-hand, but the check
appears bogus and skips over an essential synchronization step.
This check has been in place since at least 2013, although the callees
and their return values may have changed at some point in the past.
Solution
--------
The user space and kernel code paths mix various return code conventions
(negative errno, positive errno, 0/-1) which makes it difficult to
understand the final return codes and most likely lead to this confusion
in the first place.
Moreover, returning EAGAIN to indicate that data is ready to be consumed
is not appropriate in view of the existing conventions in the code base.
An explicit `enum sync_metadata_status` is returned by the domains'
sync_metadata operations which allows the common code to handle the
various conditions in a straight-forward manner and for the
"rendez-vous" to take place in the kernel case.
Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib022eee97054c0b376853dd05593e3b94bc9a8ca
This page took 0.056557 seconds and 5 git commands to generate.