Security Cache Redesign

Last modified by Vincent Massol on 2024/11/19 16:15

Description

Goals

The goals of the redesign of the security cache are the following:

  • Do not fail the computation of a security access entry when the cache doesn't accept a part of it.
  • When a cache entry is dropped because the cache is full, do not drop other entries that depend on it. This requires storing the hierarchy of entries that is needed for the invalidation of a security access entry also outside of the cache.
  • If possible, reduce the need for in particular global locks to allow more concurrency.
  • If possible, do not change the public API of the security cache.

Risks

It will no longer be possible to explicitly control the total size of the cache as the hierarchical part would not be limited and not visible from the monitoring. In wikis with a large number of groups, the number of entries in this hierarchy might be significant. However, it is also currently not fully controllable. For example, a document that has no access rules is counting to the security cache size in the same way as a potential child of it that has 500 rules. Further, as the number of potential security access entries in the cache is basically the number of pages times the number of active logged in users, I believe that the number of security access entries that will still be in the Infinispan cache should vastly outnumber the size of the hierarchy that we do not keep in the cache, which is limited by the number of documents plus the number of active users and groups. Still, there might be corner-cases where this is not true.

Background

XWiki has a security cache that speeds up access control checks. For this, it caches the security rules of documents, the hierarchy of users and groups and the actual access entries of a specific user and reference in an Infinispan cache. The actual access entries are what should be used most frequently. The whole cache is organized as a tree-like parents/children structure (a directed acyclic graph to be precise). Users have groups as parent, documents are nested according to their hierarchy in the wiki and the actual access entries have the document and the user as parent. The primary goal of this is to be able to be able to invalidate everything that depends on, e.g., a group membership of a user when that group membership is changed. For this purpose, when an item is removed from the cache, everything that depends on it is removed, too. This hierarchical information is also used to be able to quickly compute access entries when they are missing.

Now, unfortunately, there seem to be problems when this security cache becomes full. Basically, what seems to happen is that entries that are on the higher levels of the hierarchy are removed while we try inserting items that depend on them. Then, the insertion fails. At the moment, this can lead to errors when this happens repeatedly, as documented in XWIKI-18508 1. We can work around this by not relying on the cache insertion to succeed. However, this might lead to bad performance.

Infinispan doesn’t evict entries randomly but uses an algorithm named TinyLFU. For this, the cache maintains an approximation of the access frequency of recently accessed objects. Based on this approximation, it can also decide not to accept entries if the candidate that would need to be evicted has been more frequently accessed. This algorithm doesn’t take our hierarchy into account and I think this can lead to really bad decisions. When everything is working fine and basically the “first caching level” which consists of the actual access entries always gives a cache hit, we never access the items that are higher in the hierarchy. This can mean that the cache will decide to evict these items that are higher in the hierarchy and not allow their re-insertion as long as there are still other entries in the cache that have been more frequently accessed. It could therefore happen that, for example, a whole sub-wiki cannot be cached anymore for some time until we tried to insert these higher-level items so frequently that they are considered important again.

Computing Security Access Entries Without the Cache

There is absolutely no reason for not computing and returning a security access entry when the insertion into the cache fails. The code "just" needs to be restructured to not to fail when the insertion into the cache fails but to instead just continue the computation. This is an important first step that is assumed in the following Parts.

Design for the Hierarchy

  • Make children in security cache entries weak references. We might need to check if we can automatically remove disposed entries as discussed below (this seems possible in general, we just need to check if there is some data structure that is not too complex and provides this for us).
  • Store all entries of the security cache that are not access entries also in a (concurrent) hash map as weak references. This should be done in a way similar to https://github.com/hzulla/WeakValueHashMap such that values are automatically removed when they are garbage collected. We can use ReferenceMap from Apache Commons Collections as weak value map, it should support everything we need. It seems we don’t really need a thread-safe map as we already have read/write locks for cache accesses as other operations aren’t thread-safe, either. We might also be able to use a Caffeeine cache here that supports this behavior and would be thread-safe.
  • When a value is requested that is not an access entry, first check the cache, then check in this hash map/second cache. If the value is found in the map, insert it into the cache again (this might fail, though) and return it.
  • Make it possible for callers of the security cache to get a strong reference to the security cache entry such that callers can prevent entries from being evicted from this map/second cache while they are loading a hierarchy of entries. I haven’t looked at how to implement this. We should not hand out actual references to the security cache entries I think but we might hand out a class that holds a private reference (without any access options). Alternatively, there could also be a way to create some kind of “session” into which entries can be collected that won’t be discarded as long as the session is valid.
  • Do not remove children from the cache if an entry is evicted from the cache. Instead, the remove operation that is used for cache invalidation due to document/group changes needs to explicitly do a traversal of the tree to remove all entries from the cache. This should also explicitly include the map/second cache. Note that we already explicitly pause cache invalidation while loading values for a security access entry into the cache so this does not contradict with the guarantees given above.

Removing Global Locks

Invalidation Lock

The invalidation lock prevents cache invalidation while a security access entry is computed. This is to prevent that a cache entry is based on information that has been invalidated before its insertion (after the insertion, an invalidation request would just remove it). This is done by holding a read lock during the whole computation of the access entry (which can consist of several document loads) and requesting a write lock for invalidation. This can lead to deadlocks when custom code for loading security access rules writes a document. Further, it can lead to delayed document save operations. It seems thus beneficial if we could get rid of this lock. Here are some ideas for it:

Approach 1: Invalidation Counter

  • Have a 64bit atomic integer in the security cache that is increased at the end of every invalidation request, let's call this the invalidation counter.
  • When starting the computation of the security access entry, get the current value of the invalidation counter.
  • When inserting an entry (of any type) into the security cache, pass the value of the invalidation counter from the start of the computation. If the current invalidation counter is larger than the passed counter, the entry is not stored.

This requires the refactoring mentioned above to still return the security access entry when the insertion into the cache failed. This should be correct as no entry will be inserted into the cache whose computation is based on any information that was available before the end of the last invalidation request. Due to the write lock on the security cache, we know that any currently running invalidation completes before the insertion and thus it is not possible to insert an entry while the invalidation is still running and the invalidation counter hasn't been increased yet.

This could lead to more cache misses but at least in wikis without a large number of automatic document saves, cache invalidation due to document updates should be rare.

Approach 2: Atomic Insertion

The disadvantage of the invalidation counter is that it might prevent the insertion of completely valid and unaffected entries if there is any cache invalidation. Further, it requires the global write lock on the cache to work, removing that lock could be interesting, however.

The idea here is to associate every entry in the cache with a unique identifier that is guaranteed to be different after the entry has been disposed and re-inserted (this could be a unique 64-bit integer). When inserting an entry into the security cache, we would pass the identifiers of the parents with it and the cache would ensure that the parents that are in the cache have the passed identifiers and could thus detect when the parents (or any ancestors) have been invalidated and re-inserted in the meantime. There is still the problem that the entry itself could be invalidated while it is loaded. To avoid that, the security cache would need to support atomic insertion of entries. What I mean by this is that we would need a computeIfAbsent-like method that would guarantee that any dispose request that arrives after the start of the computation of the entry leads to an insertion failure.

Read/Write-Lock of the Cache

At the moment, the security cache has a read/write lock that protects all read/write operations. Most likely this doesn't have a big impact as no expensive operations should be performed under this lock. Still, it could be interesting to avoid it to allow even more concurrency. Instead, we could have locks on individual entries. However, it might be limited what we can do here due to the limited API of the actual cache. Apart from that, the main challenges are most likely:

  • Add children of an entry atomically
  • Avoid the insertion of duplicate entries and in the case of conflicting insertion, to make sure that the entry that wins the cache insertion is also added as a child but not the one that looses the cache insertion.
  • Correct invalidation in the case of concurrent insertion of to-be-invalidated entries requires careful design.

Overall, the risk of bugs might be larger than the benefits.


 

Get Connected