Language spam as the name suggests pollutes the language dimension thereby affecting related views, metrics & reports on GA. This is similar to the common GA referrer spam and has been seen a lot since late Oct 2016.
GA accepts any of the ISO 639–2 Language codes set on the browser. Even though the browser APIs expose the locale or user-language settings to be a read only, it can be tampered with before ga.js sends a beacon to the google servers.
Once the traffic is recorded by GA it is impossible to delete or edit the data. so cleaning up the language spam takes 2 steps.
- Block future language spam traffic.
- Filter out historic spam data.
1. Block future language spam traffic
To prevent the language spam in the future we create a view filter with the below regular expression to exclude traffics that has more than 8 symbols (ISO 639–2) in the language field.
Hit the verify this filter to dry run it against past weeks traffic data.
2. Filter out historic spam data
For damage control create a segment using the same regular expression from above to exclude all language spam traffic.
The demerit is that this segment has to be applied every time a report is pulled with a date range that matches the recorded language spam date range.