It has come a long way in the last couple of years. For example:
Vid2Vid:
https://arxiv.org/abs/1808.06601
Live Face De-Identification:
https://research.fb.com/wp-content/uploads/2019/10/Live-Face-De-Identification-in-Video.pdf
I actually co-authored this year a paper (very similar in spirit to Live Face De-Identification) that got into CVPR (top vision venue, and the venue with the highest impact in entire CS), that is more focused on images, but we were able to get temporal consistency for free (by simply smoothing the trajectories of each frame).
I agree that video generation has still a long way to go, and at the moment, Photoshop is better at generating DeepFakes than AI approaches, but there has been tremendous progress recently.