[dev] Fix kronolith-agenda script

Tue Apr 17 16:15:05 UTC 2012

Zitat von Gonçalo Queirós <goncalo.queiros at portugalmail.net>:

> Citando Jan Schneider <jan at horde.org>:
>> Zitat von Gonçalo Queirós <goncalo.queiros at portugalmail.net>:
>>> Citando Jan Schneider <jan at horde.org>:   > Zitat von Gonçalo  
>>> Queirós <goncalo.queiros at portugalmail.net>:    > On 03/21/2012  
>>> 11:40 AM, Gonçalo Queirós wrote:   > Citando Jan Schneider  
>>> <jan at horde.org>:    > Zitat von Gonçalo Queirós  
>>> <goncalo.queiros at portugalmail.net>:     > Hi there dev.     >>> We  
>>> were trying to use the kronolith-agenda script to send daily  
>>> agendas to
>>>>>>>>            everyone on our service, but the problem is that  
>>>>>>>> we currently have more
>>>>>>>>            than 350.000 shares and the script just runs out  
>>>>>>>> of memory, even with 2Gb!
>>>>>>>>
>>>>>>>>            Looking at the script more closely, we think  
>>>>>>>> there's a way to make the
>>>>>>>>            script work, regardless of the number of shares on  
>>>>>>>> the system, but we would
>>>>>>>>            like your opinion on that before coming out with a patch.
>>>>>>>>
>>>>>>>>            Current script:
>>>>>>>>            1 - Get every calendar share
>>>>>>>>            2 - For every share, list its events, to check if  
>>>>>>>> it has any event to the
>>>>>>>>            current day
>>>>>>>       This is not correct, there is no such step
>>>>>>      Sorry, was debugging on my own code ;-)    >> 3 - For the  
>>>>>> remaining list get all users that have access to the calendars   
>>>>>>     >> 4 - For every user, check if he desires to receive the  
>>>>>> daily agenda (pref)
>>>>>>>>            5 - For every user that wants to receive the daily  
>>>>>>>> agenda, get his
>>>>>>>>            calendars
>>>>>>>>            6 - Again for every calendar get the ones that  
>>>>>>>> have any event to the
>>>>>>>>            current day
>>>>>>>>            7 - Send the email if there's any calendar left
>>>>>>>>
>>>>>>>>            For our installation the current script stops on  
>>>>>>>> the first step, because
>>>>>>>>            it runs out of memory.
>>>>>>>>            We thought on creating sub-sets for the shares,  
>>>>>>>> but the problem is that we
>>>>>>>>            only know the full agenda of a user after we  
>>>>>>>> analyze all shares he has
>>>>>>>>            access to.
>>>>>>>>
>>>>>>>>            What we propose:
>>>>>>>>            1 - Get every users that desire to receive the  
>>>>>>>> daily agenda (pref)
>>>>>>>       That's exactly what steps 1, 3, and 4 do.
>>>>>>      I know, the difference is that instead of asking for all  
>>>>>> the shares, we
>>>>>>        could ask directly for all the users that wan't to  
>>>>>> receive an agenda, so
>>>>>>        we could eventually narrow the search.
>>>>>>        Also, knowing the user instead of the calendars allows  
>>>>>> us to immediately
>>>>>>        send the user his agenda, which will free the memory  
>>>>>> when the loop for
>>>>>>        that user ends.    >> 2 - execute the steps 5,6,7 of the  
>>>>>> original script      >> In the worst case scenario this script  
>>>>>> will still perform better than the
>>>>>>>>            current one, because it doesn't have the first 3 steps.
>>>>>>>>            With this approach we can create sub-sets of users  
>>>>>>>> which will allow the
>>>>>>>>            script to run until the end without running out of  
>>>>>>>> memory (even if this is
>>>>>>>>            a long process, it will execute)
>>>>>>>>
>>>>>>>>            Problems:
>>>>>>>>            We don't think there's currently a method to  
>>>>>>>> retrieve all prefs from the
>>>>>>>>            backed by its name. Maybe we need to create it,  
>>>>>>>> and state that this is for
>>>>>>>>            admin purposes only and shouldn't be called by the  
>>>>>>>> user-level code (just
>>>>>>>>            like the listAllShares method from Horde_Share_Sql).
>>>>>>>>            Currently the pref_name column is not indexed, so  
>>>>>>>> we expect slow queries.
>>>>>>>>            The fix for that is obvious.
>>>>>>>       The problem is that not all backends support listing  
>>>>>>> preference details for others than the current user. On the  
>>>>>>> other hand, those backends are already missing some features  
>>>>>>> anyway, and we already have a listScopes() method that is only  
>>>>>>> implemented in a few backends too. In the end this might be an  
>>>>>>> option.
>>>>>>>          Actually, since the same script already requires the  
>>>>>>> preference backend to return any user's preference, this would  
>>>>>>> be a safe approach.
>>>>>>>
>>>>>>>          Another approach would be do take a further look at  
>>>>>>> why listAllShares() exceeds memory. Well, there is not much to  
>>>>>>> look actually when creating 350.000+ share objects and then  
>>>>>>> attempting to sort them. Alternatively we could implement an  
>>>>>>> Iterator interface in the share drivers and only receive the  
>>>>>>> shares one by one while looping them.
>>>>>>>
>>>>>>>          Jan.
>>>>>>>
>>>>>>>          --
>>>>>>>          The Horde Project
>>>>>>> http://www.horde.org/
>>>>>>>
>>>>>>>
>>>>>>>          --
>>>>>>>          Horde developers mailing list
>>>>>>>          Frequently Asked Questions: http://horde.org/faq/To  
>>>>>>> unsubscribe, mail: dev-unsubscribe at lists.horde.org
>>>>>>      If you iterate the shares one by one i think you will end up with a
>>>>>>        similar problem, because you can't send an agenda before  
>>>>>> you have all the
>>>>>>        events from all the calendars that a user has access to.  
>>>>>> So you would
>>>>>>        probably end up again with the 350.000+ shares on memory.
>>>>>>
>>>>>>        Will try to produce a patch and submit for your appreciation.
>>>>>     Jan, attached is a first preview patch, so you can see if  
>>>>> things are going in the desired direction.
>>>>>     Returning a class that implements the ArrayAccess allows  
>>>>> backward compatibility, but unfortunately, any array_ like  
>>>>> function (array_merge, array_keys, etc) don't work, so a bit  
>>>>> more of refactor is needed.
>>>>>
>>>>>     One thing that needs to be done is move the Horde_Share_List  
>>>>> factory to an injector (if I understand the injectors  
>>>>> correctly), but for now I instantiated the objects inside the  
>>>>> Horde_Share_Sql directly.
>>>>>
>>>>>     Another problem I found was that Horde_Shares can have  
>>>>> callbacks for listings, so for now I create the new  
>>>>> Horde_Share_List object and if the callback is active, that  
>>>>> object is sent to the callback. This might brake up for users  
>>>>> that rely on array_ like functions, because as stated above they  
>>>>> wont work now.
>>>>>
>>>>>     What do you devs think?
>>>>    - You cannot implement ArrayAccess in the base class and  
>>>> Iterator in the sub-class only. Why do you implement ArrayAccess  
>>>> at all? If necessary, the Horde_Share_List consumers can use  
>>>> iterator_to_array().
>>>   If i remember correclty, you sugested the ArrayAccess for a matter of
>>>   backward compatibility.
>>  Take a look at how the share list results are consumed, i.e.  
>> whether there is array access necessary at all. This was just a  
>> suggestion to ease the transition with keeping the API as close as  
>> possible to the older version. If we only loop over the results  
>> anyway, you only need to implement the iterator.
>>  What I said about the base and child class still holds though.
> Sorry for the delay, have missed your email.
> There are to many places where positions from the array are accessed
> directly, like kronolith, imp, ingo and nag, so i think it would be good
> to stick with arrayaccess for now.
> Ok, i will make both classes implement the interfaces
>>>> - You cannot use PDO functionality in the SQL driver. You need to  
>>>> use the Horde_Db API. Horde_Db's select() already returns an  
>>>> Iterator that you can proxy in Horde_Share_List.
>>>   The Horde_Db api states that it returns an iterator, but it actually
>>>   returns a PDO statement, that's why i need to use PDO functionality.
>>  It depends on the driver what's being returned, and you must use  
>> select(), not execute(). The point is, whatever is returned by  
>> select(), even PDO statements, *are* iterators. They implement  
>> Traversable to be more specific.
> Yes, i will use the select instead of execute.
> The PDO statement is an iterator, you are correct, the problem is that it
> returns an array of fields, instead of the share object, and since the
> code beeing replaced returns an array of shares, i think its desirable to
> maintain this behavior.

I did not suggest to directly use the Horde_Db iterator. My suggestion  
was to make Horde_Share_List an iterator *wrapping* or *proxying* the  
db iterator. This would for example look like (pseudo code):

class Horde_Share_List implements Iterator
{
     public function current()
     {
         return $this->_driver->createObject(current($this->_listResult));
     }
}

>>>> - You cannot use the injector or any other globals in library code.
>>>   Ok. >   > - You only need share_id in the SQL driver's SELECT query.
>>>   I also need the name (because of the key), but i understand your  
>>> point.   > - Using the iterator obviously scales better  
>>> memory-wise, but did you test how it scales performance-wise. IIUC  
>>> you are now reading each share individually from the backend which  
>>> could be a performance degration.
>>>   I haven't tested this performance-wise, but i don't doubt it will run
>>>   slower than the current approach. The difference is that the current
>>>   approach will stop working when you get to a certain amount of shares. I
>>>   will test this a bit further so we can have a better knowledge  
>>> of the real
>>>   impact in performance as soon as i have a stable patch ;-)
>>  This will be a key point. I rather accept that the agenda script  
>> simply doesn't work with a large number of shares, than losing any  
>> speed. The sqlng share driver is highly optimized and tuned for  
>> high performance even with a huge number of shares.
>>  If we cannot have both, we can still implement a separate method  
>> that's only used by the agenda script or any other cli script that  
>> can run longer.
>  From the tests i produced, the iterator is always slower:
> 5 times slower when getting 50 shares
> 9 times slower when getting 1000 and 10k shares
> 5 times slower when getting 100k shares

Okay, that's not acceptable.

> I also noticed that the iterator was consuming a lot of memory, because
> Horde_Share_Base keeps a local cache of the shares retrieved.
> So iterating over 100k shares was consuming 675Mb. I found some solutions
> to this:
>
> - Pass an extra parameter to Horde_Share_List that will make a call to
> Horde_Share_Base->clearCache everytime  a new share is retrieving (thus
> invalidating the cache)

Bad idea.

> - Only introduce the Horde_Share_List code to the methods that are
> supposed to be used by admin and cli scripts, and have the
> Horde_Share_List invalidating the Horde_Share_Base cache everytime a new
> share is retrieved.

This sounds like the only viable solution given the performance loss.  
You have to verify this, but I guess we only use listAllShares() in  
cli/admin scripts, and listAllShares() is only used in cli/admin  
scripts.
If that's true, you can only modify that method to return the  
iterator. If not, you need to add a new method for that purpose.

-- 
Jan Schneider
The Horde Project
http://www.horde.org/