Sunday, October 14, 2012

Download/backup BibTeX library from CiteULike, including all the attached PDF

CiteULike is a convenient solution for bibliographic management, but sometimes an off-line replica of the bibliography is also needed. This can be achieved by exporting the CiteULike library in BibTeX format, furthermore BibDesk (and probably other managers too) can be synchronized directly with the CiteULike server.

While it is easy to export a .bib file from CiteULike, downloading all the attached files and having the .bib linking to the downloaded files is a different story.

BibTeX and JSON exports of the CiteULike library can be retrieved from:
  http://www.citeulike.org/json/user/USERNAME
  http://citeulike.org/bibtex/user/USERNAME
The BibTeX export can be read directly into a bibliographic manager, but the JSON export contains more information than the .bib, in particular it contains the location of the attached PDFs. So the idea of the Python script linked below is to do the following
  1. download the CiteULike library in BibTeX and JSON formats
  2. parse the JSON export and download all the attachments
  3. modify the .bib file to include links to the downloaded copies of the attachments
the links in the .bib file should work for BibDesk and JabRef.

Before running the script:
  • setup CITEULIKE_USERNAME and CITEULIKE_PASSWORD variables in the script
  • verify you have wget and pybtex installed
Download the Python script:  citeulike_backup.zip

http://wiki.citeulike.org/index.php/Importing_and_Exporting#JSON


Gory details from http://wiki.citeulike.org/index.php/Importing_and_Exporting#JSON:
# save session cookies
> wget -O /dev/null --keep-session-cookies  --save-cookies cookies.txt --post-data="username=xxxx&password=yyyy&perm=1" http://www.citeulike.org/login.do
# download bibtex with private comments and download an attachment
> wget -O export.bib --load-cookies cookies.txt http://www.citeulike.org/bibtex/user/xxxx
> wget --load-cookies cookies.txt http://www.citeulike.org//pdf/user/xxxx/article/123456/891011/some_99_paper_123456.pdf

2 comments:

  1. It seems that with version 1.13.4 of wget CiteULike may block access, resulting in the wget request to fail with a 403 forbidden message.

    This can be fixed by adding this option to the wget call:

    --header='User-Agent: yourname@gmail.com'


    Source: http://www.citeulike.org/groupforum/2781

    ReplyDelete
  2. I've just updated the script (now v1.2): incorporating Carlo's fix, and removing escape sequences from the DOI/url fields.

    ReplyDelete