Wiki source code of Notifications Optimization

Last modified by Thomas Mortagne on 2020/02/07 13:46

Show last authors
1 Currently, we have a big problem with notifications performances. In many cases, users are forced to disable notifications so they can simply use their wiki. The more recent example is https:~/~/jira.xwiki.org/browse/XWIKI-16207.
2
3 {{toc/}}
4
5 = Problem 1: Post-Filters can cause very long loop =
6
7 To understand the problem, we need to consider the current implemenation.
8
9 1. The browser needs to know how many unread notifications there are for the current user. If there is more than 20 notifications, it does not need to know the exact number. Instead, it will display "20+".
10 1. An AJAX query is sent for this purpose.
11 1. A SQL query is generated, that take care of all enabled filters (don't show the user's event, for example) and watched pages. To avoid using too much database resource, we limit this request to the first 40 results.
12 1. For each result, some checks are performed:(((
13 * 4.1. First of all, if the event concerns a document that the user is not allowed to view, then the event is discarded. There is no way to improve the SQL query to take care of the rights so this check cannot be avoided.
14 * 4.2. Post-filters are executed. Like the right check, these filters allows to check what cannot be expressed with an SQL query.
15 * 4.3. The event is then compared to all events that we already have accepted, in order to group similar notifications inside a "fold" one that we call a CompositeEvent. The idea is to avoid having multiple notifications in the UI that concern the same document, with the same kind of event, but with different dates (like when you click "save & continue" on a document a multiple time during a work session).
16 )))
17 1. After the results have been checked and grouped into CompositeEvents, we count how many of them we have accepted. If we have less than 20 composite events, we go back to step 3 until we have at least 20 CompositeEvents, or until there is no more event in the database.
18
19 As you can see, the steps 3-4-5 can be executed a lot of times, in bad conditions. It is currently implemented as a recursive algorithm, which could theoretically lead to a stack overflow (see: https:~/~/jira.xwiki.org/browse/XWIKI-15927).
20 \\On probematic wikis, I often notice these kind of stacktraces, with a lot of repeating:
21
22 {{code}}
23 [...]
24 org.xwiki.notifications.sources.internal.DefaultParametrizedNotificationManager.getEvents(DefaultParametrizedNotificationManager.java:142)
25 org.xwiki.notifications.sources.internal.DefaultParametrizedNotificationManager.getEvents(DefaultParametrizedNotificationManager.java:142)
26 org.xwiki.notifications.sources.internal.DefaultParametrizedNotificationManager.getEvents(DefaultParametrizedNotificationManager.java:142)
27 org.xwiki.notifications.sources.internal.DefaultParametrizedNotificationManager.getEvents(DefaultParametrizedNotificationManager.java:142)
28 [...]
29 {{/code}}
30
31
32 So this is exactly what is going on. It means the SQL queries return a lot of events, but almost all of them are filtered by post-filters or are so similar that they are grouped in a few CompositeEvent.
33 \\Some scenarios I can see (in descending order or probability):
34
35 * A. There is a lot of events in documents that the user is not allowed to see. Adding a filter for the user profile on the restricted space could solve the issue.
36 * B. There is a bug in a post-filter and we need to identify which one and why.
37 * C. There is a lot of "personal messages" (using the Message Sender Gadget) that are filtered only by post-filters (I don't remember why it cannot be expressed with SQL but I had a good reason).
38 * D. The same event is stored multiple times in the database, so it continuously fill the same CompositeEvent.
39 * E. There is a bug in the recursion so the database always return the same results (but it would mean we have an infinite loop, so it would crash).
40
41 == Solution 1 - A: Add a timeout to the query ==
42
43 To avoid an infinite loop, we could simply add a timeout.
44
45 Pros:
46
47 * simple and efficient (see the commit https://github.com/xwiki/xwiki-platform/commit/bc59ddaaaa062ba3e946063c1e1b01597ad3d6b8 before it was reverted)
48
49 (((
50 Cons:
51
52 * some important notifications could be lost
53 * it does not really fix the issue. It just hides it.
54
55 (((
56 == Solution 1 - B: Create a thread pool for the notifications queries ==
57
58 With a minor priority, these queries would be executed without slowing down the wiki. However, it does not speed up the queries themselves.
59 )))
60 )))
61
62 = Problem 2: Notifications are computed each time a page is loaded =
63
64 There is absolutely no cache mechanism. So, even if the query to fetch the notifications is long, it will be re-executed the next time a user loads a page.
65
66 == Solution 2 - A: Create a memory cache ==
67
68 For each user, we could have a cache that store all the notifications that were returned during the last execution. This would be cleared:
69
70 * each time a new event is triggered
71 * after a certain period of time
72
73 (((
74 == Solution 2 - B: Create an "evolving" memory cache ==
75
76 Same than 2-A, but when a new event is triggered, the cache would not be cleared. Instead, the new event will be considered for each user, and added in the caches if the event is not filtered by the user preferences.
77
78 == [[Solution 2 - C: Store in a permanent storage the notifications for each user>>Proposal.NotificationsOptimization.NotificationStatusStorage.WebHome]] ==
79
80 This title is not exactly what I meant. The idea is not to store each notification in a table (the event is already in the event stream store), but the "read" status of each event for each user, **when the event is recorded**.
81
82 Concretely, this is how it would perform:
83
84 1. An event is triggered in XWiki.
85 1. The event is recorded in the Event Stream table.
86 1. For each user, we compute either or not the event should be displayed, according to her preferences.
87 11. If the event should be displayed for the user, create a line in the "Event Status" table for the user, with the "read" value to false.
88 11. If the event should not be displayed for the user, it is simply dismissed.
89 1. When the user fetch the notifications, there is no complex SQL query to perform anymore. We just need to look into the "Event Status" table which events are linked to the current user id.
90
91 (((
92 Pros:
93
94 * No need for SQL injections, resulting in a complex SQL query, that can be buggy.
95 * Even if the notifications are fetched every time a page is loaded, it is not very costly (the "Event Status" table acts like a cache).
96
97 (((
98 Cons:
99
100 * On a wiki with a very large number of users, it means we would need to write a lot of lines in the "Event Status" table every time an event is triggered.
101 * Since the filters need to be computed for each individual user, this process can take a lot of time, and it must be performed asynchronously.
102 * "Event Status" are generated for users who never goes to the wikis, which is a waste of resource.
103 * The "Event Status" table should probably be cleaned regularly to avoid having too much data, meaning some precious notifications could be lost while a user is absent (like a 2 weeks holidays).
104 * **This mechanism cannot be applied for the Notifications Macro (that was designed to replace the Activity Stream), so it means both mechanism should co-exists.**
105 * A thread pool is required anyway, see Solution 1B.
106 )))
107 )))
108 )))
109
110 === Questions to be answered for Solution 2 - C ===
111
112 * Which store should be used to sore the "Event Status" table? Currently, it is a table in hibernate, but SOLR could be considered, or any NoSQL solution.
113 * Does it scale?
114 ** How long does it take to perform 200K writes (this is what would happen for a wiki with 200K users when everyone is subscribed to the right events and a page is modified for ex)?
115
116 = Temporary conclusion =
117
118 My favorite solution (gdelhumeau) is **2-C** because it can fix the 2 problems. But its salability needs to be tested, and it is something that will requires a lot of development, which make it difficult to do in a short period of time (especially by me) and harder to backport in 10.11.x.
119
120 === Don't forget to apply the same strategy for emails, for which the filters can be different! ===
121
122 === ===

Get Connected