miner module

miner.search(ids=None, member=None, filter=None, limit=500, **kwargs)

Search Crossref to get text mining links

Parameters:
  • ids – [Array] DOIs (digital object identifier) or other identifiers
  • member – [String] member ids
  • filter – [Hash] Filter options. See ...
  • limit – [Fixnum] Number of results to return. Not relavant when searching with specific dois. Default: 20. Max: 1000
  • kwargs – any additional arguments will be passed on to requests.get
Returns:

A dictionary, of results

Usage:

from pyminer import miner
miner.search(filter = {'has_full_text': True}, limit = 5)
miner.search(filter = {'full_text_type': 'text/plain', 'license_url': "http://creativecommons.org/licenses/by-nc-nd/3.0"})
miner.search(filter = {'has_full_text': True, 'license_url': "http://creativecommons.org/licenses/by/4.0"})
miner.fetch(url)

Get full text

Work easily for open access papers, but for closed. For non-OA papers, use Crossref’s Text and Data Mining service, which requires authentication and pre-authorized IP address. Go to https://apps.crossref.org/clickthrough/researchers to sign up for the TDM service, to get your key. The only publishers taking part at this time are Elsevier and Wiley.

Parameters:url – [String] A url for full text
Returns:[Mined] An object of class Mined, with methods for extracting the url requested, the file path, and parsing the plain text, XML, or extracting text from the pdf.

XML returns object of class lxml.etree._Element, which you can parse using for example lxml

Usage:

from pyminer import miner

# pdf
url = "http://www.banglajol.info/index.php/AJMBR/article/viewFile/25509/17126"
out = miner.fetch(url)
out.url
out.path
out.type
out.parse()

# xml
url = "https://peerj.com/articles/cs-23.xml"
out = miner.fetch(url)
out.url
out.path
out.type
out.parse()
## or drop down to individual parsing methods
from pyminer import parsers as p
p.parse_xml(out.path)
p.parse_xml_string(out.path)

# search first, then pass links to fetch
res = miner.search()
miner.fetch(res['url'])
miner.extract(path)

Extract full text fro pdf’s

Parameters:path – [String] Path to a pdf file downloaded via {fetch}, or another way.
Returns:[str] a string of text

Usage:

from pyminer import miner

# a pdf
url = "http://www.banglajol.info/index.php/AJMBR/article/viewFile/25509/17126"
out = miner.fetch(url)
out.parse()

# search first, then pass links to fetch
res = miner.search(filter = {'has_full_text': True, 'license_url': "http://creativecommons.org/licenses/by/4.0"})
# url = res.links_pdf()[0]
url = 'http://www.nepjol.info/index.php/JSAN/article/viewFile/13527/10928'
x = miner.fetch(url)
miner.extract(x.path)