Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/get assignments from s3 #190

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
1094d0b
first proof-of-concept for skimming
soldni Mar 26, 2022
3fa2618
new ui, async calls
soldni Mar 28, 2022
d22e20b
mmda for counting tokens + optional UI
soldni Mar 29, 2022
94711e6
using decorators for dpi
soldni Mar 29, 2022
f09332e
increased timeout + small UI improvement
soldni Apr 9, 2022
6bda481
added support for mmda
soldni Apr 11, 2022
35b4ec6
fixed small bugs
soldni Apr 11, 2022
e405e75
updated save location
soldni Apr 11, 2022
c7ec0a5
changed paths to avoid build failure in API on GCP
soldni Apr 11, 2022
a7f7f35
pawls cli functionality is now duplicated in api
soldni Apr 11, 2022
db0ea83
increased build timeout
soldni Apr 11, 2022
7430581
small changes to package list
soldni Apr 11, 2022
1e99ed6
even more time for building
soldni Apr 11, 2022
16df31d
Update skiff.json
codeviking Apr 11, 2022
88d5f52
Remove liveness probes. (#170)
codeviking Apr 11, 2022
6687760
removed `share_memory` for model.
soldni Apr 11, 2022
c372f72
removed mmda dep
soldni Apr 12, 2022
952304b
small tweaks
soldni Apr 12, 2022
0cdc1d0
Add support for editing the number of replicas. (#171)
codeviking Apr 20, 2022
1b2fbe6
improvement on ui
soldni Apr 26, 2022
4adf230
added user info
soldni Apr 26, 2022
60b7668
displaying user info
soldni Apr 26, 2022
59d7cc9
a bit more logging
soldni Apr 26, 2022
8a91b69
added support for S3 backend
soldni Apr 28, 2022
2b83582
username, finished/junk switches fix
soldni Apr 28, 2022
f99e706
increased maximum size of upload
soldni Apr 28, 2022
d14a3d7
full label reveals on hover
soldni Apr 28, 2022
0bd000f
better data handling, messaging if paper missing
soldni Apr 28, 2022
1444889
writing with indent
soldni Apr 29, 2022
12789a4
added documentation
soldni Apr 29, 2022
ee4eed2
changed width
soldni Apr 29, 2022
206647e
added debug page
soldni May 2, 2022
60dcafe
improved process
soldni May 3, 2022
3425c3a
new links
soldni Jun 2, 2022
4bb9d3f
Update skimming-annotations.md
soldni Jul 15, 2022
fe541bf
updated replicas
soldni Sep 28, 2022
89eac59
Update skiff.json
soldni Sep 28, 2022
7a55852
reduced resource usage
soldni Sep 28, 2022
42c29c5
Updating output_directory
egork520 Nov 1, 2022
b80c3d5
updated deps
soldni Nov 1, 2022
87a640e
Fetching main
egork520 Nov 1, 2022
000cbd6
Adding import of AWS access environment variables
egork520 Nov 2, 2022
d18dd49
Keeping only labels we need for the project
egork520 Nov 2, 2022
d7523fe
Revert "Keeping only labels we need for the project"
egork520 Nov 4, 2022
783a434
Updating output directory and README.md
egork520 Nov 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ ui/*.log
# Ignore any files in ./skiff_files
skiff_files/*

### Python Build Related ###
### Python Build Related ###

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -157,4 +157,11 @@ dmypy.json
.pytype/

# Cython debug symbols
cython_debug/
cython_debug/

# macOS
.DS_Store
node_modules/
.vscode
/tmp
ui/package-lock.json
2 changes: 1 addition & 1 deletion .skiff/cloudbuild-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,4 @@ artifacts:
location: 'gs://skiff-archive/$REPO_NAME/$_ENV/$BUILD_ID/$COMMIT_SHA'
paths: ['.skiff/webapp.yaml']

timeout: 900s
timeout: 3600s
56 changes: 27 additions & 29 deletions .skiff/webapp.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,6 @@ function(
config.appName + '.' + env + topLevelDomain
];

// In production we run two versions of your application, as to ensure that
// if one instance goes down or is busy, end users can still use the application.
// In all other environments we run a single instance to save money.
local replicas = (
if env == 'prod' then
2
else
1
);

// Each app gets it's own namespace.
local namespaceName = config.appName;

Expand Down Expand Up @@ -122,7 +112,9 @@ function(
'nginx.ingress.kubernetes.io/ssl-redirect': 'true',
'nginx.ingress.kubernetes.io/auth-url': 'https://google.login.apps.allenai.org/oauth2/auth',
'nginx.ingress.kubernetes.io/auth-signin': 'https://google.login.apps.allenai.org/oauth2/start?rd=https://$host$request_uri',
'nginx.ingress.kubernetes.io/auth-response-headers': 'X-Auth-Request-User, X-Auth-Request-Email'
'nginx.ingress.kubernetes.io/auth-response-headers': 'X-Auth-Request-User, X-Auth-Request-Email',
'nginx.ingress.kubernetes.io/proxy-read-timeout': '300',
'nginx.ingress.kubernetes.io/proxy-body-size': '50m'
}
},
spec: {
Expand Down Expand Up @@ -163,7 +155,7 @@ function(
},
spec: {
revisionHistoryLimit: 3,
replicas: replicas,
replicas: config.replicas,
selector: {
matchLabels: selectorLabels
},
Expand Down Expand Up @@ -215,7 +207,27 @@ function(
{
name: fullyQualifiedName + '-api',
image: apiImage,
env: [ { name: "IN_PRODUCTION", value: "prod" }],
env: [
{ name: "IN_PRODUCTION", value: "prod" },
{
name: "AWS_ACCESS_KEY_ID",
valueFrom: {
secretKeyRef: {
name: "aws-pdf-iam",
key: "AWS_ACCESS_KEY_ID"
}
}
},
{
name: "AWS_SECRET_ACCESS_KEY",
valueFrom: {
secretKeyRef: {
name: "aws-pdf-iam",
key: "AWS_SECRET_ACCESS_KEY"
}
}
},
],
volumeMounts: [
{
mountPath: '/skiff_files/apps/pawls',
Expand Down Expand Up @@ -265,14 +277,6 @@ function(
periodSeconds: 10,
failureThreshold: 3
},
livenessProbe: {
httpGet: apiHealthCheck + {
path: '/?check=liveness_probe'
},
periodSeconds: 10,
failureThreshold: 9,
initialDelaySeconds: 30
},
# This tells Kubernetes what CPU and memory resources your API needs.
# We set these values low by default, as most applications receive
# bursts of activity and accordingly don't need dedicated resources
Expand Down Expand Up @@ -310,16 +314,10 @@ function(
path: '/?check=rdy'
}
},
livenessProbe: {
failureThreshold: 6,
httpGet: proxyHealthCheck + {
path: '/?check=live'
}
},
resources: {
requests: {
cpu: '50m',
memory: '100Mi'
cpu: '500m',
memory: '500Mi'
}
}
}
Expand Down
45 changes: 44 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,11 @@ For instance, you can run the following commands to download, preprocess, and as
pawls assign skiff_files/apps/pawls/papers [email protected] --all --name-file skiff_files/apps/pawls/papers/name_mapping.json
```

and then open up the UI locally by running `docker-compose up`.
#### Getting annotation files to s3
PDFs and assignment files from status folder need to be copied to s3 bucket `output_directory` specified in `api/config/configuration.json`
Annotations are going to be uploaded to the `output_directory`.

And then open up the UI locally by running `docker-compose up`.

### Authentication and Authorization

Expand Down Expand Up @@ -203,3 +207,42 @@ If you find PAWLS helpful for your research, please consider cite PAWLS.
---

PAWLS is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

## Replica Management

Because the application is used in short bursts for annotation projects, we manually turn
the application on and off. We do this by managing the number or replicas, toggling it from
`0` to `1` and vice versa.

To adjust the number of replicas, edit the `skiff.json` and change the replica
count. For instance, you can turn the application "off" like so:

```diff
{
"appName": "pawls",
"contact": "lucas",
"team": "s2research",
- "replicas": 1
+ "replicas": 0
}
```

...and turn it back "on" by reversing that change:

```diff
{
"appName": "pawls",
"contact": "lucas",
"team": "s2research",
- "replicas": 0
+ "replicas": 1
}
```

The change will be applied after committing and pushing your change. It usually
takes around 5 minutes or so for things to take effect.

You can confirm the change by visiting [Marina](https://marina.apps.allenai.org/a/pawls)
and inspecting the "Replicas" list for the `skimming-annotations` environment.
The number of replicas displayed there should match match the value in `skiff.json`.

6 changes: 4 additions & 2 deletions api/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM python:3.7.2
FROM python:3.8.12


# Setup a spot for the api code
WORKDIR /usr/local/src/skiff/app/api
Expand All @@ -11,6 +12,7 @@ COPY requirements.txt .

RUN pip install -r requirements.txt

########## COPYING SOURCE CODE FROM HERE ON ##########

# Copy over the source code
COPY app app/
Expand All @@ -19,4 +21,4 @@ COPY main.py main.py

# Kick things off
ENTRYPOINT [ "uvicorn" ]
CMD ["main:app", "--host", "0.0.0.0"]
CMD ["main:app", "--host", "0.0.0.0"]
21 changes: 20 additions & 1 deletion api/app/annotations.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from typing import Optional, List
from pydantic import BaseModel
from pydantic import BaseModel, Field, validator


class Bounds(BaseModel):
Expand Down Expand Up @@ -36,3 +36,22 @@ class RelationGroup(BaseModel):
class PdfAnnotation(BaseModel):
annotations: List[Annotation]
relations: List[RelationGroup]


class PageSpec(BaseModel):
width: int
height: int
index: int


class PageToken(BaseModel):
text: str
width: float
height: float
x: float
y: float


class Page(BaseModel):
page: PageSpec
tokens: List[PageToken] = Field(default_factory=lambda: [])
126 changes: 126 additions & 0 deletions api/app/pdfplumber.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
from typing import List

import pandas as pd
import pdfplumber

from .annotations import PageToken, Page

import json
import logging
from pathlib import Path
from typing import Union


logger = logging.getLogger("uvicorn")


class PDFPlumberTokenExtractor:

@staticmethod
def convert_to_pagetoken(row: pd.Series) -> Page:
"""Convert a row in a DataFrame to pagetoken"""
return dict(
text=row["text"],
x=row["x0"],
width=row["width"],
y=row["top"],
height=row["height"],
)

def extract(self, pdf_path: str) -> List[Page]:
"""Extracts token text, positions, and style information from a PDF file.

Args:
pdf_path (str): the path to the pdf file.
include_lines (bool, optional): Whether to include line tokens. Defaults to False.

Returns:
PdfAnnotations: A `PdfAnnotations` containing all the paper token information.
"""
plumber_pdf_object = pdfplumber.open(pdf_path)

pages = []
for page_id in range(len(plumber_pdf_object.pages)):
cur_page = plumber_pdf_object.pages[page_id]

tokens = self.obtain_word_tokens(cur_page)

page = dict(
page=dict(
width=float(cur_page.width),
height=float(cur_page.height),
index=page_id
),
tokens=tokens
)
pages.append(page)

return pages

def obtain_word_tokens(self, cur_page: pdfplumber.page.Page) -> List[PageToken]:
"""Obtain all words from the current page.
Args:
cur_page (pdfplumber.page.Page):
the pdfplumber.page.Page object with PDF token information

Returns:
List[PageToken]:
A list of page tokens stored in PageToken format.
"""
words = cur_page.extract_words(
x_tolerance=1.5,
y_tolerance=3,
keep_blank_chars=False,
use_text_flow=True,
horizontal_ltr=True,
vertical_ttb=True,
extra_attrs=["fontname", "size"],
)
if len(words) == 0:
return []

df = pd.DataFrame(words)

# Avoid boxes outside the page
df[["x0", "x1"]] = df[["x0", "x1"]].\
clip(lower=0, upper=int(cur_page.width)).\
astype("float")

df[["top", "bottom"]] = df[["top", "bottom"]].\
clip(lower=0, upper=int(cur_page.height)).\
astype("float")

df["height"] = df["bottom"] - df["top"]
df["width"] = df["x1"] - df["x0"]

word_tokens = df.apply(self.convert_to_pagetoken, axis=1).tolist()
return word_tokens



def process_pdfplumber(file_path: Union[str, Path]) -> Path:
"""
Run a pre-processor on a pdf/directory of pawls pdfs and
write the resulting token information to the pdf location.
"""
file_path = Path(file_path)

if not file_path.exists():
msg = f'Cannot find {file_path}'
raise ValueError(msg)

structure_path = file_path.parent / "pdf_structure.json"

if not structure_path.exists():

logging.info(f"Processing {file_path} using pdfplumber...")

pdf_extractors = PDFPlumberTokenExtractor()
data = pdf_extractors.extract(file_path)

with open(structure_path, mode="w+", encoding='utf-8') as f:
json.dump(data, f)
else:
logging.warn(f"Parsed {structure_path} exists, skipping...")

return structure_path
Loading