Data leakage occurs when information from outside the training dataset improperly influences the model, leading to overly optimistic performance estimates. It often happens when data used for training includes future or irrelevant information that wouldn't be available in real-world scenarios. This compromises the model's validity, causing it to perform poorly on new, unseen data and undermining its generalization capability.