miner module¶
-
miner.
search
(ids=None, member=None, filter=None, limit=500, **kwargs)¶ Search Crossref to get text mining links
Parameters: - ids – [Array] DOIs (digital object identifier) or other identifiers
- member – [String] member ids
- filter – [Hash] Filter options. See ...
- limit – [Fixnum] Number of results to return. Not relavant when searching with specific dois. Default: 20. Max: 1000
- kwargs – any additional arguments will be passed on to
requests.get
Returns: A dictionary, of results
Usage:
from pyminer import miner miner.search(filter = {'has_full_text': True}, limit = 5) miner.search(filter = {'full_text_type': 'text/plain', 'license_url': "http://creativecommons.org/licenses/by-nc-nd/3.0"}) miner.search(filter = {'has_full_text': True, 'license_url': "http://creativecommons.org/licenses/by/4.0"})
-
miner.
fetch
(url)¶ Get full text
Work easily for open access papers, but for closed. For non-OA papers, use Crossref’s Text and Data Mining service, which requires authentication and pre-authorized IP address. Go to https://apps.crossref.org/clickthrough/researchers to sign up for the TDM service, to get your key. The only publishers taking part at this time are Elsevier and Wiley.
Parameters: url – [String] A url for full text Returns: [Mined] An object of class Mined, with methods for extracting the url requested, the file path, and parsing the plain text, XML, or extracting text from the pdf. XML returns object of class lxml.etree._Element, which you can parse using for example lxml
Usage:
from pyminer import miner # pdf url = "http://www.banglajol.info/index.php/AJMBR/article/viewFile/25509/17126" out = miner.fetch(url) out.url out.path out.type out.parse() # xml url = "https://peerj.com/articles/cs-23.xml" out = miner.fetch(url) out.url out.path out.type out.parse() ## or drop down to individual parsing methods from pyminer import parsers as p p.parse_xml(out.path) p.parse_xml_string(out.path) # search first, then pass links to fetch res = miner.search() miner.fetch(res['url'])
-
miner.
extract
(path)¶ Extract full text fro pdf’s
Parameters: path – [String] Path to a pdf file downloaded via {fetch}, or another way. Returns: [str] a string of text Usage:
from pyminer import miner # a pdf url = "http://www.banglajol.info/index.php/AJMBR/article/viewFile/25509/17126" out = miner.fetch(url) out.parse() # search first, then pass links to fetch res = miner.search(filter = {'has_full_text': True, 'license_url': "http://creativecommons.org/licenses/by/4.0"}) # url = res.links_pdf()[0] url = 'http://www.nepjol.info/index.php/JSAN/article/viewFile/13527/10928' x = miner.fetch(url) miner.extract(x.path)