[dev] Fix kronolith-agenda script

Tue Apr 17 15:34:44 UTC 2012

Citando Jan Schneider <jan at horde.org>:
> Zitat von Gonçalo Queirós <goncalo.queiros at portugalmail.net>:
>> Citando Jan Schneider <jan at horde.org>:   > Zitat von Gonçalo  
>> Queirós <goncalo.queiros at portugalmail.net>:    > On 03/21/2012  
>> 11:40 AM, Gonçalo Queirós wrote:   > Citando Jan Schneider  
>> <jan at horde.org>:    > Zitat von Gonçalo Queirós  
>> <goncalo.queiros at portugalmail.net>:     > Hi there dev.     >>> We  
>> were trying to use the kronolith-agenda script to send daily  
>> agendas to
>>>>>>>             everyone on our service, but the problem is that  
>>>>>>> we currently have more
>>>>>>>             than 350.000 shares and the script just runs out  
>>>>>>> of memory, even with 2Gb!
>>>>>>>
>>>>>>>             Looking at the script more closely, we think  
>>>>>>> there's a way to make the
>>>>>>>             script work, regardless of the number of shares on  
>>>>>>> the system, but we would
>>>>>>>             like your opinion on that before coming out with a patch.
>>>>>>>
>>>>>>>             Current script:
>>>>>>>             1 - Get every calendar share
>>>>>>>             2 - For every share, list its events, to check if  
>>>>>>> it has any event to the
>>>>>>>             current day
>>>>>>        This is not correct, there is no such step
>>>>>       Sorry, was debugging on my own code ;-)    >> 3 - For the  
>>>>> remaining list get all users that have access to the calendars    
>>>>>    >> 4 - For every user, check if he desires to receive the  
>>>>> daily agenda (pref)
>>>>>>>             5 - For every user that wants to receive the daily  
>>>>>>> agenda, get his
>>>>>>>             calendars
>>>>>>>             6 - Again for every calendar get the ones that  
>>>>>>> have any event to the
>>>>>>>             current day
>>>>>>>             7 - Send the email if there's any calendar left
>>>>>>>
>>>>>>>             For our installation the current script stops on  
>>>>>>> the first step, because
>>>>>>>             it runs out of memory.
>>>>>>>             We thought on creating sub-sets for the shares,  
>>>>>>> but the problem is that we
>>>>>>>             only know the full agenda of a user after we  
>>>>>>> analyze all shares he has
>>>>>>>             access to.
>>>>>>>
>>>>>>>             What we propose:
>>>>>>>             1 - Get every users that desire to receive the  
>>>>>>> daily agenda (pref)
>>>>>>        That's exactly what steps 1, 3, and 4 do.
>>>>>       I know, the difference is that instead of asking for all  
>>>>> the shares, we
>>>>>         could ask directly for all the users that wan't to  
>>>>> receive an agenda, so
>>>>>         we could eventually narrow the search.
>>>>>         Also, knowing the user instead of the calendars allows  
>>>>> us to immediately
>>>>>         send the user his agenda, which will free the memory  
>>>>> when the loop for
>>>>>         that user ends.    >> 2 - execute the steps 5,6,7 of the  
>>>>> original script      >> In the worst case scenario this script  
>>>>> will still perform better than the
>>>>>>>             current one, because it doesn't have the first 3 steps.
>>>>>>>             With this approach we can create sub-sets of users  
>>>>>>> which will allow the
>>>>>>>             script to run until the end without running out of  
>>>>>>> memory (even if this is
>>>>>>>             a long process, it will execute)
>>>>>>>
>>>>>>>             Problems:
>>>>>>>             We don't think there's currently a method to  
>>>>>>> retrieve all prefs from the
>>>>>>>             backed by its name. Maybe we need to create it,  
>>>>>>> and state that this is for
>>>>>>>             admin purposes only and shouldn't be called by the  
>>>>>>> user-level code (just
>>>>>>>             like the listAllShares method from Horde_Share_Sql).
>>>>>>>             Currently the pref_name column is not indexed, so  
>>>>>>> we expect slow queries.
>>>>>>>             The fix for that is obvious.
>>>>>>        The problem is that not all backends support listing  
>>>>>> preference details for others than the current user. On the  
>>>>>> other hand, those backends are already missing some features  
>>>>>> anyway, and we already have a listScopes() method that is only  
>>>>>> implemented in a few backends too. In the end this might be an  
>>>>>> option.
>>>>>>           Actually, since the same script already requires the  
>>>>>> preference backend to return any user's preference, this would  
>>>>>> be a safe approach.
>>>>>>
>>>>>>           Another approach would be do take a further look at  
>>>>>> why listAllShares() exceeds memory. Well, there is not much to  
>>>>>> look actually when creating 350.000+ share objects and then  
>>>>>> attempting to sort them. Alternatively we could implement an  
>>>>>> Iterator interface in the share drivers and only receive the  
>>>>>> shares one by one while looping them.
>>>>>>
>>>>>>           Jan.
>>>>>>
>>>>>>           --
>>>>>>           The Horde Project
>>>>>> http://www.horde.org/
>>>>>>
>>>>>>
>>>>>>           --
>>>>>>           Horde developers mailing list
>>>>>>           Frequently Asked Questions: http://horde.org/faq/To  
>>>>>> unsubscribe, mail: dev-unsubscribe at lists.horde.org
>>>>>       If you iterate the shares one by one i think you will end up with a
>>>>>         similar problem, because you can't send an agenda before  
>>>>> you have all the
>>>>>         events from all the calendars that a user has access to.  
>>>>> So you would
>>>>>         probably end up again with the 350.000+ shares on memory.
>>>>>
>>>>>         Will try to produce a patch and submit for your appreciation.
>>>>      Jan, attached is a first preview patch, so you can see if  
>>>> things are going in the desired direction.
>>>>      Returning a class that implements the ArrayAccess allows  
>>>> backward compatibility, but unfortunately, any array_ like  
>>>> function (array_merge, array_keys, etc) don't work, so a bit more  
>>>> of refactor is needed.
>>>>
>>>>      One thing that needs to be done is move the Horde_Share_List  
>>>> factory to an injector (if I understand the injectors correctly),  
>>>> but for now I instantiated the objects inside the Horde_Share_Sql  
>>>> directly.
>>>>
>>>>      Another problem I found was that Horde_Shares can have  
>>>> callbacks for listings, so for now I create the new  
>>>> Horde_Share_List object and if the callback is active, that  
>>>> object is sent to the callback. This might brake up for users  
>>>> that rely on array_ like functions, because as stated above they  
>>>> wont work now.
>>>>
>>>>      What do you devs think?
>>>     - You cannot implement ArrayAccess in the base class and  
>>> Iterator in the sub-class only. Why do you implement ArrayAccess  
>>> at all? If necessary, the Horde_Share_List consumers can use  
>>> iterator_to_array().
>>    If i remember correclty, you sugested the ArrayAccess for a matter of
>>    backward compatibility.
>   Take a look at how the share list results are consumed, i.e.  
> whether there is array access necessary at all. This was just a  
> suggestion to ease the transition with keeping the API as close as  
> possible to the older version. If we only loop over the results  
> anyway, you only need to implement the iterator.
>   What I said about the base and child class still holds though.
Sorry for the delay, have missed your email.
There are to many places where positions from the array are accessed
directly, like kronolith, imp, ingo and nag, so i think it would be good
to stick with arrayaccess for now.
Ok, i will make both classes implement the interfaces
>>> - You cannot use PDO functionality in the SQL driver. You need to  
>>> use the Horde_Db API. Horde_Db's select() already returns an  
>>> Iterator that you can proxy in Horde_Share_List.
>>    The Horde_Db api states that it returns an iterator, but it actually
>>    returns a PDO statement, that's why i need to use PDO functionality.
>   It depends on the driver what's being returned, and you must use  
> select(), not execute(). The point is, whatever is returned by  
> select(), even PDO statements, *are* iterators. They implement  
> Traversable to be more specific.
Yes, i will use the select instead of execute.
The PDO statement is an iterator, you are correct, the problem is that it
returns an array of fields, instead of the share object, and since the
code beeing replaced returns an array of shares, i think its desirable to
maintain this behavior.
>>> - You cannot use the injector or any other globals in library code.
>>    Ok. >   > - You only need share_id in the SQL driver's SELECT query.
>>    I also need the name (because of the key), but i understand your  
>> point.   > - Using the iterator obviously scales better  
>> memory-wise, but did you test how it scales performance-wise. IIUC  
>> you are now reading each share individually from the backend which  
>> could be a performance degration.
>>    I haven't tested this performance-wise, but i don't doubt it will run
>>    slower than the current approach. The difference is that the current
>>    approach will stop working when you get to a certain amount of shares. I
>>    will test this a bit further so we can have a better knowledge  
>> of the real
>>    impact in performance as soon as i have a stable patch ;-)
>   This will be a key point. I rather accept that the agenda script  
> simply doesn't work with a large number of shares, than losing any  
> speed. The sqlng share driver is highly optimized and tuned for high  
> performance even with a huge number of shares.
>   If we cannot have both, we can still implement a separate method  
> that's only used by the agenda script or any other cli script that  
> can run longer.
  From the tests i produced, the iterator is always slower:
5 times slower when getting 50 shares
9 times slower when getting 1000 and 10k shares
5 times slower when getting 100k shares

I also noticed that the iterator was consuming a lot of memory, because
Horde_Share_Base keeps a local cache of the shares retrieved.
So iterating over 100k shares was consuming 675Mb. I found some solutions
to this:

- Pass an extra parameter to Horde_Share_List that will make a call to
Horde_Share_Base->clearCache everytime  a new share is retrieving (thus
invalidating the cache)
- Only introduce the Horde_Share_List code to the methods that are
supposed to be used by admin and cli scripts, and have the
Horde_Share_List invalidating the Horde_Share_Base cache everytime a new
share is retrieved.

What you think?
Best regards,

Gonçalo Queirós