Help using XML data feeds

Hi all,

I'm working on a site which needs to extract data from a few XML data feeds (in a schema unrelated to RSS and Atom) and aggregate it for display. At the moment I'm using a model which downloads and processes the XML data 'by hand' in PHP but has the significant drawback that it results in a significant pause on the client end.

I'd like to take a better approach by hooking into SPIP's cron and cache functionality to download my XML and process it without delaying the client. Can anyone point me to English documentation or an example that can help me avoid writing my own mechanisms to cache data and avoid client pauses.

Cheers,

Thomas Sutton

Maybe the best way is to do your php treatment in a standard template,
by using custom filters :

I would do something like this :
----
file : /squelettes/extract_data.html
#CACHE{3600}
[(#ENV{data}|extract_data{xml}|other_action)]
----
file: /squelettes/extract_data_fonctions.php
<?php

function extract_data($txt,$format) {
   $txt_changed = do_what_you_need_to_do();
   return $txt_changed;
}

function other_action($txt){
  anything_you_want_like_store_in_db_or_else();
  return ''; // to display nothing
}

?>
-----
The result of the extraction will be stored in cache for a hour. (Note
that extract_data_fonctions.php will only be called when
extract_data.html will be stored in cache -- I think it's better than
putting all this specific stuff inside mes_fonctions.php)

You don't have to hook #CRON if you want that your treatment may be
calculated on every call : #CRON simply include a picture which is in
fact a spip action. Just add in your template an iframe that point to
spip.php?page=extract_data

Of course the content of extract_data.html can be included inside
other templates, like jquery.js.html :slight_smile:

Another solution is to use static include :
[(#INCLUDE{page=extract_data}{env})]
instead of <INCLUDE>
and insert this peace of code into a template with #CACHE
Then you don't have to use filters, your template included portion can
mix html and php treatment : this won't be recalculated until the
refresh of the main page : #INCLUDE returns a string, while <INCLUDE>
inserts all the included php code inside the cache of the main file.

Is it what you are looking for ?

.Gilles
---

On 10/29/07, Thomas Sutton <thomas@bouncingorange.com> wrote:

Hi all,

I'm working on a site which needs to extract data from a few XML data
feeds (in a schema unrelated to RSS and Atom) and aggregate it for
display. At the moment I'm using a model which downloads and
processes the XML data 'by hand' in PHP but has the significant
drawback that it results in a significant pause on the client end.

I'd like to take a better approach by hooking into SPIP's cron and
cache functionality to download my XML and process it without
delaying the client. Can anyone point me to English documentation or
an example that can help me avoid writing my own mechanisms to cache
data and avoid client pauses.

Cheers,

Thomas Sutton
_______________________________________________
spip-en@rezo.net - http://listes.rezo.net/mailman/listinfo/spip-en

Hello Thomas,

Gilles approach is the simplest to put in place and depending on the cache, shouldn't be too hard on your server and visitors. But as you say, the visitor that will trigger the cache refresh will have to wait a long time.

I think Gilles is misleading (involuntarily, I am sure :wink: when he says the cron approach will also slow down your visitors.

The cron in SPIP doesn't rely on the unix cron table (as it might not be accessible for the user), but on it's own implementation.
What SPIP does, is insert a small image in each web page (or only the only with the #CRON tag), the loading of the image will call a php page running the crons and returning a transparent pixel size image.

This way, the page load is not slowed down as modern browser usually load images in the background. And the cron tasks can be spread amongst multiple visits.

Inserting your own task into the SPIP cron task list is not that difficult, but you will have to create a plugin to do that.
This is not really documented anywhere, not even in French for what I know...

In latest versions of SPIP , plugins can integrate with SPIP throught "pipelines". The system is simple:
-* the plugin says it is interested in a particular pipeline
-* at the execution of the php containing the pipeline, SPIP will call all plugins that declared themselves and expect a particular result from them.

There is a pipeline (1.9.2 has it, the oldest, I am not sure) for inserting a new task into the crons. It's called taches_generales_cron.

So you will need to create a plugin to register with that pipeline, to create the plugin, take this template:

Edit the plugin.xml to specify the data about your plugin and remove all the <pipeline> tags except the one for taches_generales_cron.

You won't need the <bouton>, <fonctions> and <options> part I think.
The <prefix> tag specifies a unique ID for your plugin that will be the prefix of all your files and functions to avoid clashes. Specify it to something you want.

Now, you will want to copy the template_pipelines_cron.php and keep the function template_taches_generales_cron($flux)
(replace 'template' by your prefix)

This is the pipeline callback that will insert your new function into the cron, you have to modify the $flux to insert your task by doing something like:
function yourprefix_taches_generales_cron($flux)
   $flux['yourprefix_cron'] = 3600*24;
return $flux;
}

Flux is an array of handles to cron functions (the one executing your task) pointing to delays in seconds, here we say the cron has to be executed every 24 hours.

Now, for the implementation of the callback handle.
When executing crons, SPIP will look for a file to include with the name yourprefix_cron.php, it should be in SPIP include path, the right way to do that in your plugin is to create a file inc/yourprefix_cron.php under your plugin directory.

After loading that file, SPIP will call the function cron_yourprefix_cron($t). You should implement this in your file and this should execute your task.

I am not sure of what the $t parameter stands for, I thinks it's the time of the last execution of your task.
Your function should also return a int code, but I am not sure what it's used for, I found that returning 1 is sufficient if your task is not too long.
For what I understand of the code (ecrire/inc/cron.php), if this is negative, then it means that the task is not finished yet and should be called back for the current time frame (for example, if you have long computations to do, multiple feeds to process, etc.).

You can see some examples in the files:
ecrire/inc/syndic.php
ecrire/inc/indexation.php
ecrire/inc/rubriques.php
...
(look for the cron_ functions in ecrire/inc).

In your case, the cron task will execute your feed processing function. I suppose the best will be to put the processed data in a database table to then get them back with your own SPIP loop in your templates. Declaring a loop on your own data table is not too complex and loads of examples and information exist for that.

I hope this helps,

Pierre

Thomas Sutton wrote:

Hi all,

I'm working on a site which needs to extract data from a few XML data feeds (in a schema unrelated to RSS and Atom) and aggregate it for display. At the moment I'm using a model which downloads and processes the XML data 'by hand' in PHP but has the significant drawback that it results in a significant pause on the client end.

I'd like to take a better approach by hooking into SPIP's cron and cache functionality to download my XML and process it without delaying the client. Can anyone point me to English documentation or an example that can help me avoid writing my own mechanisms to cache data and avoid client pauses.

Cheers,

Thomas Sutton

Wouah ! This is a begining of a documentation Pierre : you will
certainly be quoted in spip-contrib in the "carnets de notes" :wink:

For the #CRON, I didn't want to say that it can slow down, the page load.
The code used for it is the best choise :
<div style="background-image:
url('http://www.spip.net/spip.php?action=cron’);"></div>

So the javascript onload event doesn't way that this image is completly loaded.
(for people who doesn't know the difference between window.onload()
and jQuery(document).ready() : For the first one, the javascript
action is triggered only when everything has been loaded ('images,
flash, etc..) whereas the second one triggers it's function as soon as
the html code has been downloaded..

Sorry, it starts to be a little out of topic

.Gilles
---
On 10/29/07, Pierre Andrews <mortimer.pa@free.fr> wrote:

Hello Thomas,

Gilles approach is the simplest to put in place and depending on the
cache, shouldn't be too hard on your server and visitors. But as you
say, the visitor that will trigger the cache refresh will have to wait a
long time.

I think Gilles is misleading (involuntarily, I am sure :wink: when he says
the cron approach will also slow down your visitors.

The cron in SPIP doesn't rely on the unix cron table (as it might not be
accessible for the user), but on it's own implementation.
What SPIP does, is insert a small image in each web page (or only the
only with the #CRON tag), the loading of the image will call a php page
running the crons and returning a transparent pixel size image.

This way, the page load is not slowed down as modern browser usually
load images in the background. And the cron tasks can be spread amongst
multiple visits.

Inserting your own task into the SPIP cron task list is not that
difficult, but you will have to create a plugin to do that.
This is not really documented anywhere, not even in French for what I
know...

In latest versions of SPIP , plugins can integrate with SPIP throught
"pipelines". The system is simple:
-* the plugin says it is interested in a particular pipeline
-* at the execution of the php containing the pipeline, SPIP will call
all plugins that declared themselves and expect a particular result from
them.

There is a pipeline (1.9.2 has it, the oldest, I am not sure) for
inserting a new task into the crons. It's called taches_generales_cron.

So you will need to create a plugin to register with that pipeline, to
create the plugin, take this template:
Connexion · GitLab
Edit the plugin.xml to specify the data about your plugin and remove all
the <pipeline> tags except the one for taches_generales_cron.

You won't need the <bouton>, <fonctions> and <options> part I think.
The <prefix> tag specifies a unique ID for your plugin that will be the
prefix of all your files and functions to avoid clashes. Specify it to
something you want.

Now, you will want to copy the template_pipelines_cron.php and keep the
function template_taches_generales_cron($flux)
(replace 'template' by your prefix)

This is the pipeline callback that will insert your new function into
the cron, you have to modify the $flux to insert your task by doing
something like:
function yourprefix_taches_generales_cron($flux)
  $flux['yourprefix_cron'] = 3600*24;
return $flux;
}

Flux is an array of handles to cron functions (the one executing your
task) pointing to delays in seconds, here we say the cron has to be
executed every 24 hours.

Now, for the implementation of the callback handle.
When executing crons, SPIP will look for a file to include with the name
yourprefix_cron.php, it should be in SPIP include path, the right way to
do that in your plugin is to create a file inc/yourprefix_cron.php under
your plugin directory.

After loading that file, SPIP will call the function
cron_yourprefix_cron($t). You should implement this in your file and
this should execute your task.

I am not sure of what the $t parameter stands for, I thinks it's the
time of the last execution of your task.
Your function should also return a int code, but I am not sure what it's
used for, I found that returning 1 is sufficient if your task is not too
long.
For what I understand of the code (ecrire/inc/cron.php), if this is
negative, then it means that the task is not finished yet and should be
called back for the current time frame (for example, if you have long
computations to do, multiple feeds to process, etc.).

You can see some examples in the files:
ecrire/inc/syndic.php
ecrire/inc/indexation.php
ecrire/inc/rubriques.php
...
(look for the cron_ functions in ecrire/inc).

In your case, the cron task will execute your feed processing function.
I suppose the best will be to put the processed data in a database table
to then get them back with your own SPIP loop in your templates.
Declaring a loop on your own data table is not too complex and loads of
examples and information exist for that.

I hope this helps,

Pierre

Thomas Sutton wrote:
> Hi all,
>
> I'm working on a site which needs to extract data from a few XML data
> feeds (in a schema unrelated to RSS and Atom) and aggregate it for
> display. At the moment I'm using a model which downloads and
> processes the XML data 'by hand' in PHP but has the significant
> drawback that it results in a significant pause on the client end.
>
> I'd like to take a better approach by hooking into SPIP's cron and
> cache functionality to download my XML and process it without
> delaying the client. Can anyone point me to English documentation or
> an example that can help me avoid writing my own mechanisms to cache
> data and avoid client pauses.
>
> Cheers,
>
> Thomas Sutton

_______________________________________________
spip-en@rezo.net - http://listes.rezo.net/mailman/listinfo/spip-en

Gilles Vincent wrote:

Wouah ! This is a begining of a documentation Pierre : you will
certainly be quoted in spip-contrib in the "carnets de notes" :wink:

well, I did send that to doc.spip.org too, but I still can't understand the editorial process of that site...

I could also use the Wiki in fact, but I suppose this will be best in the official doc... we'll see.

So the javascript onload event doesn't way that this image is completly loaded.

you are right, I didn't think of the javascript case.

Pierre

Pierre and Gilles,

Thanks to both of you for your replies. I'll probably go with the plugin cron and loop version if only to avoid the pause for one client every few minutes (the client wants as close to "live" as they can get -- which is every 3--5 minutes) and doing database work in a template (which offends my sense of purity).

Again, many thanks for your replies,

Thomas Sutton

On 29/10/2007, at 10:29 PM, Pierre Andrews wrote:

Gilles Vincent wrote:

Wouah ! This is a begining of a documentation Pierre : you will
certainly be quoted in spip-contrib in the "carnets de notes" :wink:

well, I did send that to doc.spip.org too, but I still can't understand
the editorial process of that site...

I could also use the Wiki in fact, but I suppose this will be best in
the official doc... we'll see.

So the javascript onload event doesn't way that this image is completly loaded.

you are right, I didn't think of the javascript case.

Pierre