Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote diconnected and didnt download files #34

Closed
bbanzai88 opened this issue Dec 6, 2023 · 5 comments · Fixed by #35
Closed

Remote diconnected and didnt download files #34

bbanzai88 opened this issue Dec 6, 2023 · 5 comments · Fixed by #35
Labels
invalid This doesn't seem right

Comments

@bbanzai88
Copy link

bbanzai88 commented Dec 6, 2023

Hi,
Very cool project! It looks like I installed it correctly and I ran this code on a jupyter notebook:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
    medrxiv()  #  Takes ~30min and should result in ~35 MB file
    biorxiv()  # Takes ~1h and should result in ~350 MB file
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file

I get this response:

61032it [20:29, 49.63it/s]
106700it [1:45:02, 16.93it/s]

And then I get the mess below. Any ideas on what I can do? Thankyou!!

Sincerely,

tom

RemoteDisconnected                        Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    783     e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
    786     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    787 )
    788 retries.sleep()

File ~\anaconda3\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File ~\anaconda3\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:71, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     70 while do_loop:
---> 71     json_response = requests.get(
     72         self.get_papers_url.format(
     73             begin_date=begin_date, end_date=end_date, cursor=cursor
     74         )
     75     ).json()
     76     do_loop = json_response["messages"][0]["status"] == "ok"

File ~\anaconda3\lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
     63 r"""Sends a GET request.
     64 
     65 :param url: URL for the new :class:`Request` object.
   (...)
     70 :rtype: requests.Response
     71 """
---> 73 return request("get", url, params=params, **kwargs)

File ~\anaconda3\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\anaconda3\lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~\anaconda3\lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File ~\anaconda3\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    500 except (ProtocolError, OSError) as err:
--> 501     raise ConnectionError(err, request=request)
    503 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
      2 medrxiv()  #  Takes ~30min and should result in ~35 MB file
----> 3 biorxiv()  # Takes ~1h and should result in ~350 MB file
      4 chemrxiv()

File ~\anaconda3\lib\site-packages\paperscraper\get_dumps\biorxiv.py:42, in biorxiv(begin_date, end_date, save_path)
     40 # dump all papers
     41 with open(save_path, "w") as fp:
---> 42     for index, paper in enumerate(
     43         tqdm(api.get_papers(begin_date=begin_date, end_date=end_date))
     44     ):
     45         if index > 0:
     46             fp.write(os.linesep)

File ~\anaconda3\lib\site-packages\tqdm\std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:85, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     83                 yield processed_paper
     84 except Exception as exc:
---> 85     raise RuntimeError(
     86         "Failed getting papers: {} - {}".format(exc.__class__.__name__, exc)
     87     )

RuntimeError: Failed getting papers: ConnectionError - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
@jannisborn jannisborn added the invalid This doesn't seem right label Dec 6, 2023
@jannisborn
Copy link
Owner

Hi @bbanzai88, thanks for the interest in the library. I can reproduce this ConnectionError locally when calling medrxiv().

This is a problem with the API of medrxiv that is hard to directly mitigate in paperscraper. One intermediate solution can be to treat the try/except more gracefully here and then add some time.sleep if the API does not respond

Longterm best solution is to solve #33, this would imply that users dont have to download data to local disk anymore and directly query the dumps from a server where I host them. That would require more time

@memray
Copy link
Contributor

memray commented May 21, 2024

Hi @jannisborn , I'm facing the same ConnectionError issue after scraping a few hundred records. I wonder if you can share the dump somewhere so we do not have to scrape it repeatedly?

Thanks,

@jannisborn
Copy link
Owner

Hi @memray,

Which version of paperscraper do you use?

@jannisborn
Copy link
Owner

There is no update yet on #33 but if you have the time to prepare a PR, it would be great and I'm happy to assist

@memray
Copy link
Contributor

memray commented May 22, 2024

Hi @jannisborn ,

I created a PR to resolve the timed-out issue and please review it. It's working pretty well for now (except for lots of Connection error printouts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants