I've just had an e-mail from the systems administrator at our hosting company. He has had to restart his web server cluster 4 times in the last 3 days because some of our SPIP 2.0 sites are encountering apparent deadlocks and hanging PHP FastCGI processes.
If I can't resolve this problem, I'm going to have to move some of our 90+ SPIP-based web-sites onto alternative hosting. This is something that neither I nor our hosting company want to do, but they have other customers and better things to do than restart their web servers every 18 hours.
I intend to instrument the SPIP locking functions to build a complete record of SPIP's locking behaviour across a cluster of web servers. Has anyone done something like this already? I am planning to log all locking calls to a remote syslog server, but if someone has already done something similar, I would quite like to avoid the extra work.
Any suggestions about identifying or resolving this problem would extremely welcome.
I intend to instrument the SPIP locking functions to build a complete record
you could also try and remove all locking functions. The most
important use we have is to ensure that we don't read a file (with
lire_fichier()) that is not finished writing (with ecrire_fichier()),
and this could for sure be replaced by a "write to tempnam() then
mv()" behavior. It might also be faster.
another use we have (I think) is to ensure that we don't launch two
cron jobs in parallel -- this is far from being as frequent as the
atomic file writing stuff, and might also be replaced by some other
mechanism
Still, you are the only one who reports this problem (maybe others
experience it, but do not realize what happens)
I intend to instrument the SPIP locking functions to build a complete record
you could also try and remove all locking functions. The most
important use we have is to ensure that we don't read a file (with
lire_fichier()) that is not finished writing (with ecrire_fichier()),
and this could for sure be replaced by a "write to tempnam() then
mv()" behavior. It might also be faster.
Yeah, it's an alternative.
While it does not work with Windows, it can't be uses as native lock mechanism, but it could be proposed as an alternative, as the nfslock in the core.
another use we have (I think) is to ensure that we don't launch two
cron jobs in parallel -- this is far from being as frequent as the
atomic file writing stuff, and might also be replaced by some other
mechanism
With the job_queue plugin, cron jobs are managed by the queue and the protection to avoid parallel jobs is based on sql, no more on file system.