Advertisement:

Author Topic: Prevent duplicate ad posts  (Read 15690 times)

dev101

  • Osclass Hero
  • Hero Member
  • *
  • Posts: 2155
  • osclass.work
Re: Prevent duplicate ad posts
« Reply #165 on: March 12, 2017, 02:26:14 pm »
Say, you set a limit to scan/compare with 1000 other items every hour. So, new item is posted, initial scanning batch begins, you scan with first* 1000 items, no duplicates found. But, what if duplicates are contained in the rest of the items you have (say, total of 20k site-wide)? This is what I meant if you split the scanning into batches, you might not be sure if it is a dupe or not.

* this is actually why you maybe don't need entire scanning in the end, only small sample of, say, latest 1000 items, as the probability for duplicates is much higher in this case, than scanning items from past weeks or months.

Aficionado

  • Guest
Re: Prevent duplicate ad posts
« Reply #166 on: March 12, 2017, 02:29:38 pm »
Say, you set a limit to scan/compare with 1000 other items every hour. So, new item is posted, initial scanning batch begins, you scan with first* 1000 items, no duplicates found. But, what if duplicates are contained in the rest of the items you have (say, total of 20k site-wide)? This is what I meant if you split the scanning into batches, you might not be sure if it is a dupe or not.



Ok, then the new ads will be scanned in the next few CRONS. What is wrong with that, appart from some delay ? Could that be .... some ads will never be scanned ?


dev101

  • Osclass Hero
  • Hero Member
  • *
  • Posts: 2155
  • osclass.work
Re: Prevent duplicate ad posts
« Reply #167 on: March 12, 2017, 02:31:53 pm »
Well, how do you exactly imagine cron chunking? What is your idea about it?

* * *

Addition to previous thoughts/ideas: when you find duplicates (say, more than 5 - that is an alarming rate), user account should be temporarily suspended, to stop posting.

Also, scanning algorithm should be set to ignore if an item is active, spam or blocked, or coming from different accounts -- because, items that are blocked contain precious information about potential spam, if you only compare new items to the fresh content, and for example, spammer registers one last time and posts again, that one last item might be missed, if it is only compared with it's items (and it will have only 1 item in total).

Aficionado

  • Guest
Re: Prevent duplicate ad posts
« Reply #168 on: March 12, 2017, 02:38:08 pm »
Well, how do you exactly imagine cron chunking? What is your idea about it?



As i said, Osclass already uses chunk CRONS to count (?) regions/states or something. 5000 per CRON if i remember (i actually asked for chunks and Daniel was kind to do it a few years back because i was out of memory back then).

So could we inject something in there ?

If not, then in my case, an hourly cron check of the latest ads against the db, could do the job. I don't have that much ads posted daily, maybe 100 or 150.

For checking the complete ammount of ads against each other, i will have to think about it.

dev101

  • Osclass Hero
  • Hero Member
  • *
  • Posts: 2155
  • osclass.work
Re: Prevent duplicate ad posts
« Reply #169 on: March 12, 2017, 02:49:59 pm »
Yeah, but those are FIXED numbers, you don't really change your regions and cities in a daily fashion, do you?

This is why that algorithm is much simpler, you scan first 1000 cities, then next, simply storing the multiplication index into db.

Code: [Select]
$limit = max(1000, ceil($total_cities/22));
And items are live, they constantly change, new are added, old get deleted, expire, indexes are piling-up etc. Also, you need to store the scanning shift index for every item, separately. Say, you publish new item, compare it it latest N values, store position where you stopped. And next time, new items are already published, so you can go either way.

Again, if we limit the scan to latest items only, things become much more simpler.

Aficionado

  • Guest
Re: Prevent duplicate ad posts
« Reply #170 on: March 12, 2017, 03:03:26 pm »


Again, if we limit the scan to latest items only, things become much more simpler.

Well, duplicate ads usually come within a fixed period of time. Some one is hiring some people to post the same (more or less) ad. So checking OLD ads (=more that ... say ... a month old) could be useless.

So maybe Latest items could mean within a period of time, maybe a month or two ?


dev101

  • Osclass Hero
  • Hero Member
  • *
  • Posts: 2155
  • osclass.work
Re: Prevent duplicate ad posts
« Reply #171 on: March 12, 2017, 03:09:14 pm »
Well, not exactly useless, but less harmful, sure.

Say, you have a user who tries to "re-publish" (post again) his same item over and over again (every month or so, and given your site new items frequency of 100-150/day, it will be missed, if the limit is set to 1000 latest only). So, if you wish to prevent even those, another function - low priority one mind you - could run via cron, to scan items from particular users only and, well, block either the old ones, or new ones (per your preference).

The key point is, of course, to stop new "immediate" duplicates, and this can also be relatively simple to implement, just limit the scan to latest N items and you are safe. Another function, low priority one, could be split to check users with IDs, say from 1 to 1000, then next batch, then next, and split them around a full 24 hours period.

That should be more acceptable, performance wise.
« Last Edit: March 12, 2017, 03:10:56 pm by dev101 »

dev101

  • Osclass Hero
  • Hero Member
  • *
  • Posts: 2155
  • osclass.work
Re: Prevent duplicate ad posts
« Reply #172 on: March 12, 2017, 03:20:06 pm »
One /off topic/ question - do you use noCaptcha reCaptcha? Also, do you block cloud services somehow? Maybe CloudFlare's filtering? Strange that you get so much spam, but is it mostly human-driven?

I use ZB Block (as per your old suggestion :) ) for years now. There are just 2 minor changes you need to make to be PHP 7 compatible, and with latest forks for the rules @ github by Maikuolan (zbb-dirty-30, zbb-badip-fork, and CIDRAM (newer ip filter), it works really great. With some scripting wizardry, setting up auto-updates from github can put you at a complete ease of mind :)

Aficionado

  • Guest
Re: Prevent duplicate ad posts
« Reply #173 on: March 12, 2017, 03:32:59 pm »
One /off topic/ question - do you use noCaptcha reCaptcha? Also, do you block cloud services somehow? Maybe CloudFlare's filtering? Strange that you get so much spam, but is it mostly human-driven?

I use ZB Block (as per your old suggestion :) ) for years now. There are just 2 minor changes you need to make to be PHP 7 compatible, and with latest forks for the rules @ github by Maikuolan (zbb-dirty-30, zbb-badip-fork, and CIDRAM (newer ip filter), it works really great. With some scripting wizardry, setting up auto-updates from github can put you at a complete ease of mind :)

I don't use ZBBLOCk any more for my Osclass and Wordpress sites.

For Wordpress i use other better plugins.

For Osclass it blocks some human posters so now i don't block anything. I prefer to check for bad words and duplicates.



Liath

  • issues
  • Hero Member
  • *
  • Posts: 1346
  • </html> the end is always near
Re: Prevent duplicate ad posts
« Reply #174 on: March 12, 2017, 06:26:08 pm »
sorry...


i dont read and understand it all... is there a better solution to compare title/description?


i think to let the admin choose between md5 (fast/many false) and similarity (slow/less false) should be the best way. I don't see another good solution

SteveJohnson

  • Sr. Member
  • ****
  • Posts: 328
  • Golden tip - Clear your cache :|
Re: Prevent duplicate ad posts
« Reply #175 on: March 12, 2017, 06:37:19 pm »
How about a function that checks the entire DB for duplicates manually (irrespective of email addresses).
Or maybe it can run at a particular time, say everyday at 3:00 am (the timer could be set in the admin panel).
Or maybe as a queued job whose limit can be set by the admin?

Aficionado

  • Guest
Re: Prevent duplicate ad posts
« Reply #176 on: March 12, 2017, 06:45:43 pm »
sorry...


i dont read and understand it all... is there a better solution to compare title/description?


i think to let the admin choose between md5 (fast/many false) and similarity (slow/less false) should be the best way. I don't see another good solution


We are talking about check for duplicates for Different Emails (accounts) and not REALTIME. Via a cleanup/CRON.

TangoX

  • Jr. Member
  • **
  • Posts: 64
Re: Prevent duplicate ad posts
« Reply #177 on: March 15, 2017, 08:18:37 pm »
@Liath Any news on a new version containing: 1, 2, 3, 4?

Thanks!

Liath

  • issues
  • Hero Member
  • *
  • Posts: 1346
  • </html> the end is always near
Re: Prevent duplicate ad posts
« Reply #178 on: March 20, 2017, 02:21:23 am »
but please be patient, i'm very busy at moment  :-\


sorry... no

Liath

  • issues
  • Hero Member
  • *
  • Posts: 1346
  • </html> the end is always near
Re: Prevent duplicate ad posts
« Reply #179 on: March 28, 2017, 02:42:15 am »