Lesson 35 - Get Compute Auth Token Working

2026-02-28 12:32:28 -05:00
parent 1d477ee42a
commit 4fde462bce
7743 changed files with 1397833 additions and 18 deletions
--- a/Plugins/GameLiftPlugin/Source/GameLiftServer/ThirdParty/concurrentqueue/README.md
+++ b/Plugins/GameLiftPlugin/Source/GameLiftServer/ThirdParty/concurrentqueue/README.md
@@ -0,0 +1,533 @@
+# moodycamel::ConcurrentQueue<T>
+
+An industrial-strength lock-free queue for C++.
+
+Note: If all you need is a single-producer, single-consumer queue, I have [one of those too][spsc].
+
+## Features
+
+- Knock-your-socks-off [blazing fast performance][benchmarks].
+- Single-header implementation. Just drop it in your project.
+- Fully thread-safe lock-free queue. Use concurrently from any number of threads.
+- C++11 implementation -- elements are moved (instead of copied) where possible.
+- Templated, obviating the need to deal exclusively with pointers -- memory is managed for you.
+- No artificial limitations on element types or maximum count.
+- Memory can be allocated once up-front, or dynamically as needed.
+- Fully portable (no assembly; all is done through standard C++11 primitives).
+- Supports super-fast bulk operations.
+- Includes a low-overhead blocking version (BlockingConcurrentQueue).
+- Exception safe.
+
+## Reasons to use
+
+There are not that many full-fledged lock-free queues for C++. Boost has one, but it's limited to objects with trivial
+assignment operators and trivial destructors, for example.
+Intel's TBB queue isn't lock-free, and requires trivial constructors too.
+There're many academic papers that implement lock-free queues in C++, but usable source code is
+hard to find, and tests even more so.
+
+This queue not only has less limitations than others (for the most part), but [it's also faster][benchmarks].
+It's been fairly well-tested, and offers advanced features like **bulk enqueueing/dequeueing**
+(which, with my new design, is much faster than one element at a time, approaching and even surpassing
+the speed of a non-concurrent queue even under heavy contention).
+
+In short, there was a lock-free queue shaped hole in the C++ open-source universe, and I set out
+to fill it with the fastest, most complete, and well-tested design and implementation I could.
+The result is `moodycamel::ConcurrentQueue` :-)
+
+## Reasons *not* to use
+
+The fastest synchronization of all is the kind that never takes place. Fundamentally,
+concurrent data structures require some synchronization, and that takes time. Every effort
+was made, of course, to minimize the overhead, but if you can avoid sharing data between
+threads, do so!
+
+Why use concurrent data structures at all, then? Because they're gosh darn convenient! (And, indeed,
+sometimes sharing data concurrently is unavoidable.)
+
+My queue is **not linearizable** (see the next section on high-level design). The foundations of
+its design assume that producers are independent; if this is not the case, and your producers
+co-ordinate amongst themselves in some fashion, be aware that the elements won't necessarily
+come out of the queue in the same order they were put in *relative to the ordering formed by that co-ordination*
+(but they will still come out in the order they were put in by any *individual* producer). If this affects
+your use case, you may be better off with another implementation; either way, it's an important limitation
+to be aware of.
+
+My queue is also **not NUMA aware**, and does a lot of memory re-use internally, meaning it probably doesn't
+scale particularly well on NUMA architectures; however, I don't know of any other lock-free queue that *is*
+NUMA aware (except for [SALSA][salsa], which is very cool, but has no publicly available implementation that I know of).
+
+Finally, the queue is **not sequentially consistent**; there *is* a happens-before relationship between when an element is put
+in the queue and when it comes out, but other things (such as pumping the queue until it's empty) require more thought
+to get right in all eventualities, because explicit memory ordering may have to be done to get the desired effect. In other words,
+it can sometimes be difficult to use the queue correctly. This is why it's a good idea to follow the [samples][samples.md] where possible.
+On the other hand, the upside of this lack of sequential consistency is better performance.
+
+## High-level design
+
+Elements are stored internally using contiguous blocks instead of linked lists for better performance.
+The queue is made up of a collection of sub-queues, one for each producer. When a consumer
+wants to dequeue an element, it checks all the sub-queues until it finds one that's not empty.
+All of this is largely transparent to the user of the queue, however -- it mostly just works<sup>TM</sup>.
+
+One particular consequence of this design, however, (which seems to be non-intuitive) is that if two producers
+enqueue at the same time, there is no defined ordering between the elements when they're later dequeued.
+Normally this is fine, because even with a fully linearizable queue there'd be a race between the producer
+threads and so you couldn't rely on the ordering anyway. However, if for some reason you do extra explicit synchronization
+between the two producer threads yourself, thus defining a total order between enqueue operations, you might expect
+that the elements would come out in the same total order, which is a guarantee my queue does not offer. At that
+point, though, there semantically aren't really two separate producers, but rather one that happens to be spread
+across multiple threads. In this case, you can still establish a total ordering with my queue by creating
+a single producer token, and using that from both threads to enqueue (taking care to synchronize access to the token,
+of course, but there was already extra synchronization involved anyway).
+
+I've written a more detailed [overview of the internal design][blog], as well as [the full
+nitty-gritty details of the design][design], on my blog. Finally, the
+[source][source] itself is available for perusal for those interested in its implementation.
+
+## Basic use
+
+The entire queue's implementation is contained in **one header**, [`concurrentqueue.h`][concurrentqueue.h].
+Simply download and include that to use the queue. The blocking version is in a separate header,
+[`blockingconcurrentqueue.h`][blockingconcurrentqueue.h], that depends on [`concurrentqueue.h`][concurrentqueue.h] and
+[`lightweightsemaphore.h`][lightweightsemaphore.h]. The implementation makes use of certain key C++11 features,
+so it requires a relatively recent compiler (e.g. VS2012+ or g++ 4.8; note that g++ 4.6 has a known bug with `std::atomic`
+and is thus not supported). The algorithm implementations themselves are platform independent.
+
+Use it like you would any other templated queue, with the exception that you can use
+it from many threads at once :-)
+
+Simple example:
+
+```C++
+#include "concurrentqueue.h"
+
+moodycamel::ConcurrentQueue<int> q;
+q.enqueue(25);
+
+int item;
+bool found = q.try_dequeue(item);
+assert(found && item == 25);
+```
+
+Description of basic methods:
+- `ConcurrentQueue(size_t initialSizeEstimate)`
+      Constructor which optionally accepts an estimate of the number of elements the queue will hold
+- `enqueue(T&& item)`
+      Enqueues one item, allocating extra space if necessary
+- `try_enqueue(T&& item)`
+      Enqueues one item, but only if enough memory is already allocated
+- `try_dequeue(T& item)`
+      Dequeues one item, returning true if an item was found or false if the queue appeared empty
+
+Note that it is up to the user to ensure that the queue object is completely constructed before
+being used by any other threads (this includes making the memory effects of construction
+visible, possibly via a memory barrier). Similarly, it's important that all threads have
+finished using the queue (and the memory effects have fully propagated) before it is
+destructed.
+
+There's usually two versions of each method, one "explicit" version that takes a user-allocated per-producer or
+per-consumer token, and one "implicit" version that works without tokens. Using the explicit methods is almost
+always faster (though not necessarily by a huge factor). Apart from performance, the primary distinction between them
+is their sub-queue allocation behaviour for enqueue operations: Using the implicit enqueue methods causes an
+automatically-allocated thread-local producer sub-queue to be allocated.
+Explicit producers, on the other hand, are tied directly to their tokens' lifetimes (but are recycled internally).
+
+In order to avoid the number of sub-queues growing without bound, implicit producers are marked for reuse once
+their thread exits. However, this is not supported on all platforms. If using the queue from short-lived threads,
+it is recommended to use explicit producer tokens instead.
+
+Full API (pseudocode):
+
+	# Allocates more memory if necessary
+	enqueue(item) : bool
+	enqueue(prod_token, item) : bool
+	enqueue_bulk(item_first, count) : bool
+	enqueue_bulk(prod_token, item_first, count) : bool
+	
+	# Fails if not enough memory to enqueue
+	try_enqueue(item) : bool
+	try_enqueue(prod_token, item) : bool
+	try_enqueue_bulk(item_first, count) : bool
+	try_enqueue_bulk(prod_token, item_first, count) : bool
+	
+	# Attempts to dequeue from the queue (never allocates)
+	try_dequeue(item&) : bool
+	try_dequeue(cons_token, item&) : bool
+	try_dequeue_bulk(item_first, max) : size_t
+	try_dequeue_bulk(cons_token, item_first, max) : size_t
+	
+	# If you happen to know which producer you want to dequeue from
+	try_dequeue_from_producer(prod_token, item&) : bool
+	try_dequeue_bulk_from_producer(prod_token, item_first, max) : size_t
+	
+	# A not-necessarily-accurate count of the total number of elements
+	size_approx() : size_t
+
+## Blocking version
+
+As mentioned above, a full blocking wrapper of the queue is provided that adds
+`wait_dequeue` and `wait_dequeue_bulk` methods in addition to the regular interface.
+This wrapper is extremely low-overhead, but slightly less fast than the non-blocking
+queue (due to the necessary bookkeeping involving a lightweight semaphore).
+
+There are also timed versions that allow a timeout to be specified (either in microseconds
+or with a `std::chrono` object).
+
+The only major caveat with the blocking version is that you must be careful not to
+destroy the queue while somebody is waiting on it. This generally means you need to
+know for certain that another element is going to come along before you call one of
+the blocking methods. (To be fair, the non-blocking version cannot be destroyed while
+in use either, but it can be easier to coordinate the cleanup.)
+
+Blocking example:
+
+```C++
+#include "blockingconcurrentqueue.h"
+
+moodycamel::BlockingConcurrentQueue<int> q;
+std::thread producer([&]() {
+    for (int i = 0; i != 100; ++i) {
+        std::this_thread::sleep_for(std::chrono::milliseconds(i % 10));
+        q.enqueue(i);
+    }
+});
+std::thread consumer([&]() {
+    for (int i = 0; i != 100; ++i) {
+        int item;
+        q.wait_dequeue(item);
+        assert(item == i);
+        
+        if (q.wait_dequeue_timed(item, std::chrono::milliseconds(5))) {
+            ++i;
+            assert(item == i);
+        }
+    }
+});
+producer.join();
+consumer.join();
+
+assert(q.size_approx() == 0);
+```
+
+## Advanced features
+
+#### Tokens
+
+The queue can take advantage of extra per-producer and per-consumer storage if
+it's available to speed up its operations. This takes the form of "tokens":
+You can create a consumer token and/or a producer token for each thread or task
+(tokens themselves are not thread-safe), and use the methods that accept a token
+as their first parameter:
+
+```C++
+moodycamel::ConcurrentQueue<int> q;
+
+moodycamel::ProducerToken ptok(q);
+q.enqueue(ptok, 17);
+
+moodycamel::ConsumerToken ctok(q);
+int item;
+q.try_dequeue(ctok, item);
+assert(item == 17);
+```
+
+If you happen to know which producer you want to consume from (e.g. in
+a single-producer, multi-consumer scenario), you can use the `try_dequeue_from_producer`
+methods, which accept a producer token instead of a consumer token, and cut some overhead.
+
+Note that tokens work with the blocking version of the queue too.
+
+When producing or consuming many elements, the most efficient way is to:
+
+1. Use the bulk methods of the queue with tokens
+2. Failing that, use the bulk methods without tokens
+3. Failing that, use the single-item methods with tokens
+4. Failing that, use the single-item methods without tokens
+
+Having said that, don't create tokens willy-nilly -- ideally there would be
+one token (of each kind) per thread. The queue will work with what it is
+given, but it performs best when used with tokens.
+
+Note that tokens aren't actually tied to any given thread; it's not technically
+required that they be local to the thread, only that they be used by a single
+producer/consumer at a time.
+
+#### Bulk operations
+
+Thanks to the [novel design][blog] of the queue, it's just as easy to enqueue/dequeue multiple
+items as it is to do one at a time. This means that overhead can be cut drastically for
+bulk operations. Example syntax:
+
+```C++
+moodycamel::ConcurrentQueue<int> q;
+
+int items[] = { 1, 2, 3, 4, 5 };
+q.enqueue_bulk(items, 5);
+
+int results[5];     // Could also be any iterator
+size_t count = q.try_dequeue_bulk(results, 5);
+for (size_t i = 0; i != count; ++i) {
+    assert(results[i] == items[i]);
+}
+```
+
+#### Preallocation (correctly using `try_enqueue`)
+
+`try_enqueue`, unlike just plain `enqueue`, will never allocate memory. If there's not enough room in the
+queue, it simply returns false. The key to using this method properly, then, is to ensure enough space is
+pre-allocated for your desired maximum element count.
+
+The constructor accepts a count of the number of elements that it should reserve space for. Because the
+queue works with blocks of elements, however, and not individual elements themselves, the value to pass
+in order to obtain an effective number of pre-allocated element slots is non-obvious.
+
+First, be aware that the count passed is rounded up to the next multiple of the block size. Note that the
+default block size is 32 (this can be changed via the traits). Second, once a slot in a block has been
+enqueued to, that slot cannot be re-used until the rest of the block has been completely filled
+up and then completely emptied. This affects the number of blocks you need in order to account for the
+overhead of partially-filled blocks. Third, each producer (whether implicit or explicit) claims and recycles
+blocks in a different manner, which again affects the number of blocks you need to account for a desired number of
+usable slots.
+
+Suppose you want the queue to be able to hold at least `N` elements at any given time. Without delving too
+deep into the rather arcane implementation details, here are some simple formulas for the number of elements
+to request for pre-allocation in such a case. Note the division is intended to be arithmetic division and not
+integer division (in order for `ceil()` to work).
+
+For explicit producers (using tokens to enqueue):
+
+```C++
+(ceil(N / BLOCK_SIZE) + 1) * MAX_NUM_PRODUCERS * BLOCK_SIZE
+```
+
+For implicit producers (no tokens):
+
+```C++
+(ceil(N / BLOCK_SIZE) - 1 + 2 * MAX_NUM_PRODUCERS) * BLOCK_SIZE
+```
+
+When using mixed producer types:
+
+```C++
+((ceil(N / BLOCK_SIZE) - 1) * (MAX_EXPLICIT_PRODUCERS + 1) + 2 * (MAX_IMPLICIT_PRODUCERS + MAX_EXPLICIT_PRODUCERS)) * BLOCK_SIZE
+```
+
+If these formulas seem rather inconvenient, you can use the constructor overload that accepts the minimum
+number of elements (`N`) and the maximum number of explicit and implicit producers directly, and let it do the
+computation for you.
+
+In addition to blocks, there are other internal data structures that require allocating memory if they need to resize (grow).
+If using `try_enqueue` exclusively, the initial sizes may be exceeded, causing subsequent `try_enqueue` operations to fail.
+Specifically, the `INITIAL_IMPLICIT_PRODUCER_HASH_SIZE` trait limits the number of implicit producers that can be active at once
+before the internal hash needs resizing. Along the same lines, the `IMPLICIT_INITIAL_INDEX_SIZE` trait limits the number of
+unconsumed elements that an implicit producer can insert before its internal hash needs resizing. Similarly, the
+`EXPLICIT_INITIAL_INDEX_SIZE` trait limits the number of unconsumed elements that an explicit producer can insert before its
+internal hash needs resizing. In order to avoid hitting these limits when using `try_enqueue`, it is crucial to adjust the
+initial sizes in the traits appropriately, in addition to sizing the number of blocks properly as outlined above.
+
+Finally, it's important to note that because the queue is only eventually consistent and takes advantage of
+weak memory ordering for speed, there's always a possibility that under contention `try_enqueue` will fail
+even if the queue is correctly pre-sized for the desired number of elements. (e.g. A given thread may think that
+the queue's full even when that's no longer the case.) So no matter what, you still need to handle the failure
+case (perhaps looping until it succeeds), unless you don't mind dropping elements.
+
+#### Exception safety
+
+The queue is exception safe, and will never become corrupted if used with a type that may throw exceptions.
+The queue itself never throws any exceptions (operations fail gracefully (return false) if memory allocation
+fails instead of throwing `std::bad_alloc`).
+
+It is important to note that the guarantees of exception safety only hold if the element type never throws
+from its destructor, and that any iterators passed into the queue (for bulk operations) never throw either.
+Note that in particular this means `std::back_inserter` iterators must be used with care, since the vector
+being inserted into may need to allocate and throw a `std::bad_alloc` exception from inside the iterator;
+so be sure to reserve enough capacity in the target container first if you do this.
+
+The guarantees are presently as follows:
+- Enqueue operations are rolled back completely if an exception is thrown from an element's constructor.
+  For bulk enqueue operations, this means that elements are copied instead of moved (in order to avoid
+  having only some objects moved in the event of an exception). Non-bulk enqueues always use
+  the move constructor if one is available.
+- If the assignment operator throws during a dequeue operation (both single and bulk), the element(s) are
+  considered dequeued regardless. In such a case, the dequeued elements are all properly destructed before
+  the exception is propagated, but there's no way to get the elements themselves back.
+- Any exception that is thrown is propagated up the call stack, at which point the queue is in a consistent
+  state.
+
+Note: If any of your type's copy constructors/move constructors/assignment operators don't throw, be sure
+to annotate them with `noexcept`; this will avoid the exception-checking overhead in the queue where possible
+(even with zero-cost exceptions, there's still a code size impact that has to be taken into account).
+
+#### Traits
+
+The queue also supports a traits template argument which defines various types, constants,
+and the memory allocation and deallocation functions that are to be used by the queue. The typical pattern
+to providing your own traits is to create a class that inherits from the default traits
+and override only the values you wish to change. Example:
+
+```C++
+struct MyTraits : public moodycamel::ConcurrentQueueDefaultTraits
+{
+	static const size_t BLOCK_SIZE = 256;		// Use bigger blocks
+};
+
+moodycamel::ConcurrentQueue<int, MyTraits> q;
+```
+
+#### How to dequeue types without calling the constructor
+
+The normal way to dequeue an item is to pass in an existing object by reference, which
+is then assigned to internally by the queue (using the move-assignment operator if possible).
+This can pose a problem for types that are
+expensive to construct or don't have a default constructor; fortunately, there is a simple
+workaround: Create a wrapper class that copies the memory contents of the object when it
+is assigned by the queue (a poor man's move, essentially). Note that this only works if
+the object contains no internal pointers. Example:
+
+```C++
+struct MyObjectMover {
+    inline void operator=(MyObject&& obj) {
+        std::memcpy(data, &obj, sizeof(MyObject));
+        
+        // TODO: Cleanup obj so that when it's destructed by the queue
+        // it doesn't corrupt the data of the object we just moved it into
+    }
+
+    inline MyObject& obj() { return *reinterpret_cast<MyObject*>(data); }
+
+private:
+    align(alignof(MyObject)) char data[sizeof(MyObject)];
+};
+```
+
+A less dodgy alternative, if moves are cheap but default construction is not, is to use a
+wrapper that defers construction until the object is assigned, enabling use of the move
+constructor:
+
+```C++
+struct MyObjectMover {
+    inline void operator=(MyObject&& x) {
+        new (data) MyObject(std::move(x));
+        created = true;
+    }
+
+    inline MyObject& obj() {
+        assert(created);
+        return *reinterpret_cast<MyObject*>(data);
+    }
+
+    ~MyObjectMover() {
+        if (created)
+            obj().~MyObject();
+    }
+
+private:
+    align(alignof(MyObject)) char data[sizeof(MyObject)];
+    bool created = false;
+};
+```
+
+## Samples
+
+There are some more detailed samples [here][samples.md]. The source of
+the [unit tests][unittest-src] and [benchmarks][benchmark-src] are available for reference as well.
+
+## Benchmarks
+
+See my blog post for some [benchmark results][benchmarks] (including versus `boost::lockfree::queue` and `tbb::concurrent_queue`),
+or run the benchmarks yourself (requires MinGW and certain GnuWin32 utilities to build on Windows, or a recent
+g++ on Linux):
+
+```Shell
+cd build
+make benchmarks
+bin/benchmarks
+```
+
+The short version of the benchmarks is that it's so fast (especially the bulk methods), that if you're actually
+using the queue to *do* anything, the queue won't be your bottleneck.
+
+## Tests (and bugs)
+
+I've written quite a few unit tests as well as a randomized long-running fuzz tester. I also ran the
+core queue algorithm through the [CDSChecker][cdschecker] C++11 memory model model checker. Some of the
+inner algorithms were tested separately using the [Relacy][relacy] model checker, and full integration
+tests were also performed with Relacy.
+I've tested
+on Linux (Fedora 19) and Windows (7), but only on x86 processors so far (Intel and AMD). The code was
+written to be platform-independent, however, and should work across all processors and OSes.
+
+Due to the complexity of the implementation and the difficult-to-test nature of lock-free code in general,
+there may still be bugs. If anyone is seeing buggy behaviour, I'd like to hear about it! (Especially if
+a unit test for it can be cooked up.) Just open an issue on GitHub.
+	
+## Using vcpkg
+You can download and install `moodycamel::ConcurrentQueue` using the [vcpkg](https://github.com/Microsoft/vcpkg) dependency manager:
+
+```Shell
+git clone https://github.com/Microsoft/vcpkg.git
+cd vcpkg
+./bootstrap-vcpkg.sh
+./vcpkg integrate install
+vcpkg install concurrentqueue
+```
+	
+The `moodycamel::ConcurrentQueue` port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.
+
+## License
+
+I'm releasing the source of this repository (with the exception of third-party code, i.e. the Boost queue
+(used in the benchmarks for comparison), Intel's TBB library (ditto), CDSChecker, Relacy, and Jeff Preshing's
+cross-platform semaphore, which all have their own licenses)
+under a simplified BSD license. I'm also dual-licensing under the Boost Software License.
+See the [LICENSE.md][license] file for more details.
+
+Note that lock-free programming is a patent minefield, and this code may very
+well violate a pending patent (I haven't looked), though it does not to my present knowledge.
+I did design and implement this queue from scratch.
+
+## Diving into the code
+
+If you're interested in the source code itself, it helps to have a rough idea of how it's laid out. This
+section attempts to describe that.
+
+The queue is formed of several basic parts (listed here in roughly the order they appear in the source). There's the
+helper functions (e.g. for rounding to a power of 2). There's the default traits of the queue, which contain the
+constants and malloc/free functions used by the queue. There's the producer and consumer tokens. Then there's the queue's
+public API itself, starting with the constructor, destructor, and swap/assignment methods. There's the public enqueue methods,
+which are all wrappers around a small set of private enqueue methods found later on. There's the dequeue methods, which are
+defined inline and are relatively straightforward.
+
+Then there's all the main internal data structures. First, there's a lock-free free list, used for recycling spent blocks (elements
+are enqueued to blocks internally). Then there's the block structure itself, which has two different ways of tracking whether
+it's fully emptied or not (remember, given two parallel consumers, there's no way to know which one will finish first) depending on where it's used.
+Then there's a small base class for the two types of internal SPMC producer queues (one for explicit producers that holds onto memory
+but attempts to be faster, and one for implicit ones which attempt to recycle more memory back into the parent but is a little slower).
+The explicit producer is defined first, then the implicit one. They both contain the same general four methods: One to enqueue, one to
+dequeue, one to enqueue in bulk, and one to dequeue in bulk. (Obviously they have constructors and destructors too, and helper methods.)
+The main difference between them is how the block handling is done (they both use the same blocks, but in different ways, and map indices
+to them in different ways).
+
+Finally, there's the miscellaneous internal methods: There's the ones that handle the initial block pool (populated when the queue is constructed),
+and an abstract block pool that comprises the initial pool and any blocks on the free list. There's ones that handle the producer list
+(a lock-free add-only linked list of all the producers in the system). There's ones that handle the implicit producer lookup table (which
+is really a sort of specialized TLS lookup). And then there's some helper methods for allocating and freeing objects, and the data members
+of the queue itself, followed lastly by the free-standing swap functions.
+
+
+[blog]: http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++
+[design]: http://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue
+[samples.md]: https://github.com/cameron314/concurrentqueue/blob/master/samples.md
+[source]: https://github.com/cameron314/concurrentqueue
+[concurrentqueue.h]: https://github.com/cameron314/concurrentqueue/blob/master/concurrentqueue.h
+[blockingconcurrentqueue.h]: https://github.com/cameron314/concurrentqueue/blob/master/blockingconcurrentqueue.h
+[lightweightsemaphore.h]: https://github.com/cameron314/concurrentqueue/blob/master/lightweightsemaphore.h
+[unittest-src]: https://github.com/cameron314/concurrentqueue/tree/master/tests/unittests
+[benchmarks]: http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++#benchmarks
+[benchmark-src]: https://github.com/cameron314/concurrentqueue/tree/master/benchmarks
+[license]: https://github.com/cameron314/concurrentqueue/blob/master/LICENSE.md
+[cdschecker]: http://demsky.eecs.uci.edu/c11modelchecker.html
+[relacy]: http://www.1024cores.net/home/relacy-race-detector
+[spsc]: https://github.com/cameron314/readerwriterqueue
+[salsa]: http://webee.technion.ac.il/~idish/ftp/spaa049-gidron.pdf