Hi, I’m trying a text summarization exercise and I have train and test datasets with two columns text and summary (labels). I’m using T5, Pytorch, and Lightning wrapper and I have a Pytorch Dataset class that returns the dictionary key/values listed below for text and the ids, labels, and masks as tensors.
Link top the colab notebook
Link to the NEWS dataset
return dict(
text=text,
summary = data_row['summary'],
text_input_ids = text_encoding['input_ids'].flatten(),
text_attention_mask = text_encoding['attention_mask'].flatten(),
labels = labels.flatten(),
labels_attention_mask = summary_encoding['attention_mask'].flatten()
)
Then I have a Lightning Data Module class that converts the dataframes into PyTorch datasets and fits them to data loaders, returning train, val, and test data loaders
class TextSummaryDataModule(pl.LightningModule):
def __init__(
self,
train_df: pd.DataFrame,
test_df: pd.DataFrame,
tokenizer: T5Tokenizer,
batch_size: int=8,
text_max_token_len: int=512,
summary_max_token_len: int=128
):
super().__init__()
self.train_df = train_df
self.test_df = test_df
self.tokenizer = tokenizer
self.batch_size = batch_size
self.text_max_token_len = text_max_token_len
self.summary_max_token_len = summary_max_token_len
def setup(self):
self.train_dataset = TextSummaryDataset(
self.train_df,
self.tokenizer,
self.text_max_token_len,
self.summary_max_token_len
)
self.test_dataset = TextSummaryDataset(
self.test_df,
self.tokenizer,
self.text_max_token_len,
self.summary_max_token_len
)
def train_dataloader(self):
return DataLoader(
self.train_dataset,
batch_size = self.batch_size,
shuffle=True,
num_workers=2
)
def val_dataloader(self):
return DataLoader(
self.test_dataset,
batch_size = self.batch_size,
shuffle=False,
num_workers=2
)
def test_dataloader(self):
return DataLoader(
self.test_dataset,
batch_size = self.batch_size,
shuffle=False,
num_workers=2
)
Everything is working until I try to execute the model and I get the following warning and error
- UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop - but I have defined and returned this in the data module?
- Invalid Datatype for loaders: TextSummaryDataModule - I am returning a dictionary of the tokens, attention_mask, and labels for both text and summary?