In the last months I finally completed and merged a long standing debt into Rapicorn. Ever since the Rapicorn GUI layout & rendering thread got separated from the main application (user) thread, referencing widgets (from the application via the C++ binding or the Python binding) worked mostly due to luck.
I investigated and researched several remote reference counting and distributed garbage collection schemes and many kudos go to Stefan Westerfeld for being able to bounce ideas off of him over time. In the end, the best solution for Rapicorn makes use of several unique features in its remote communication layer:
- Only the client (user) thread will ever make two way calls into the server (GUI) thread, i.e. send a function call message and block for a result.
- All objects are known to live in the server thread only.
- Remote messages/calls are strictly sequenced between the threads, i.e. messages will be delivered and processed sequentially and in order of arrival.
This allows the following scheme:
- Any object reference that gets passed from the server (GUI) thread into the client (user) thread enters a server-side reference-table to keep the object alive. I.e. the server thread assumes that clients automatically “ref” new objects that pass the thread boundary.
- For any object reference that’s received by a client thread, uses are counted separately on the client-side and once the first object becomes unused, a special message is sent back to the server thread (SEEN_GARBAGE).
- At any point after receiving SEEN_GARBAGE, the server thread may opt to initiate the collection of remote object references. The current code has no artificial delays built in and does so immediately (thresholds for delays may be added in the future).
- To collect references, the serer thread swaps its reference-table for an empty one and sends out a GARBAGE_SWEEP command.
- Upon receiving GARBAGE_SWEEP, the client thread creates a list of all object references it received in the past and for which the client-side use count has dropped to zero. These objects are removed from the client’s internal bookkeeping and the list is sent back to the server as GARBAGE_REPORT.
- Upon receiving a GARBAGE_REPORT, corresponding to a previous GARBAGE_SWEEP command, the sever thread has an exact list of references to purge from its previously detached reference-table. Remaining references are merged into currently active table (the one that started empty upon GARBAGE_SWEEP initiation). That way, all object references that have been sent to the client thread but are now unused are discarded, unless they have meanwhile been added into the newly active reference-table.
So far, the scheme works really well. Swapping out the server side reference-tables copes properly with the most tricky case: A (new) object reference traveling from the server to the client (e.g. as part of a get_object() call result), while the client is about to report this very reference as unused in a GARBAGE_REPORT. Such an object reference will be received by the client after its garbage report creation and treated as a genuinely new object reference arriving, similarly to the result of a create_object() call. On the server side it is simply added into the new reference table, so it’ll survive the server receiving the garbage report and subsequent garbage disposal.
The only thing left was figuring out how to automatically test that an object is collected/unreferenced, i.e. write code that checks that objects are gone…
Since we moved to std::shared_ptr for widget reference counting and often use std::make_shared(), there isn’t really any way too generically hook into the last unref of an object to install test code. The best effort test code I came up with can be found in testgc.py. It enables GC layer debug messages, triggers remote object creation + release and then checks the debugging output for corresponding collection messages. Example:
TestGC-Create 100 widgets... TestGC-Release 100 widgets... GCStats: ClientConnectionImpl: SEEN_GARBAGE (aaaa000400000004) GCStats: ServerConnectionImpl: GARBAGE_SWEEP: 103 candidates GCStats: ClientConnectionImpl: GARBAGE_REPORT: 100 trash ids GCStats: ServerConnectionImpl: GARBAGE_COLLECTED: \ considered=103 retained=3 purged=100 active=3
The protocol is fairly efficient in saving bandwitdh and task switches: ref-messages are implicit (never sent), unref-messages are sent only once and support batching (GARBAGE_REPORT). What’s left is the two tiny messages that initiate garbage collection (SEEN_GARBAGE) and synchronize reference counting (GARBAGE_SWEEP). As I hinted earlier, in the future we can introduce arbitrary delays between the two to reduce overhead and increase batching, if that ever becomes necessary.
While its not the most user visible functionality implemented in Rapicorn, it presents an important milestone for reliable toolkit operation and a fundament for future developments like remote calls across process or machine boundaries.