OCR for construction documents does not work, we fixed it

(getanchorgrid.com)

53 points | by wcisco17 2 hours ago

12 comments

  • sreekanth850 4 minutes ago
    We’re taking a different path, building a parsing engine that converts CAD (DWG/DXF) into fully structured JSON with preserved semantics (no ML in the critical path).We also have a separate GIS parser that extracts vector data (features, layers, geometries) independently, Like to know how you handle consistency and reproducibility across runs using models and how you make it affordable, especially at scale. because as far as i know CAD and GIS need precision and accuracy.
  • Terr_ 1 hour ago
    > OCR for construction documents does not work

    I'm reminded of the Xerox JBIG2 bug back in ~2013, where certain scan settings could silently replace numbers inside documents, and bad construction-plans were one of the cases that led to it being discovered. [0]

    It wasn't overt OCR per se, end-user users weren't intending to convert pixels to characters or vice-versa.

    [0] https://www.youtube.com/watch?v=c0O6UXrOZJo&t=6m03s

  • frogguy 43 minutes ago
    Looks cool! Where are you getting the data to finetune the cv models for element extraction? I'm worried there isn't a robust enough dataset to be able to build a detection model that will generalize to all of the slightly different standards each discipline (and each firm for that matter) use.
  • testUser1228 1 hour ago
    What do you foresee being the end use case for this (or most valuable use case)?
    • wcisco17 1 hour ago
      Anyone building in or for construction tech — whether that's a startup building estimating or project management software, a construction company with an internal tech team solving this themselves, or a builder looking to automate their workflow. The common thread is drawings. Every one of those groups lives and dies by their ability to extract actionable data from a PDF that was never designed to be machine-readable. We're building the layer that makes that possible so they don't have to start from scratch.
      • wang_li 1 hour ago
        Why does the workflow lie at the level of a real or virtual piece of paper and not in the metadata from the applications used to create that piece of paper? Seems like a CAD tool would allow you to identify each element of the drawing, assigning metadata as required.
        • jsidney 53 minutes ago
          Only a small set of construction stakeholders participate in the CAD ecosystem (e.g., architects, large GCs) while a broader set of stakeholders (subcontractors, trades, smaller GCs/CMs) do not receive BIM files and work with PDFs. CAD/BIM is a wonderful aspiration but for many the reality is PDFs.
          • instig007 1 minute ago
            Re. "CAD/BIM", technically speaking CAD doesn't imply BIM, and the industry's promotion of BIM is akin to AI promotion among software engineering teams - the benefits aren't clear upon detailed review of the advertised capabilities. The CAD part, on the other hand, is generally recognized as the essential tooling for the profession and I'm surprised to hear that it just is a "wonderful aspiration".
        • cyanydeez 45 minutes ago
          Oh you sweet summer child. These draws are anywhere from 0 to 120 years old and might just be something pulled out of a floppy disk from 1970 to scanned in coffee ridden pieces of paper sitting in a desk folded a hundred times.

          The world in which metadata is a common thing attached to any file doesn't exist, and probably never will, no matter how much you try to improve CAD work flow.

  • Iulioh 1 hour ago
    When will this be available for 30000x8000px electrical diagrams?

    I have to make a BOM and oh boy I hate my job

    • oritron 1 hour ago
      What software made the bitmap? Seems like a step earlier in the pipeline could help generate a BOM more easily.
      • Iulioh 1 hour ago
        I'm not really sure and I don't have access to it, I just recive flat PDFs or TIFFs

        A lot of them are "archival" so I'm pretty OOL

    • alexeischiopu 1 hour ago
      I’m building a similar platform, with electrical being furthest ahead - SLD, panels, lights, power, comms.

      Also do doors, windows, and mechanical equipment.

      dm, and I can include you in the next preview.

      • testUser1228 2 minutes ago
        I'm not sure how to dm on here, but I'm very interested
      • Iulioh 1 hour ago
        I work in the automotive field, I don't know if this complicates the things further but I appreciate any help!
    • jsidney 1 hour ago
      What do you hate the most?
  • hspraggins77 1 hour ago
    Great points raised!
  • alexeischiopu 1 hour ago
    Good idea :)
  • vessenes 1 hour ago
    cool. What's pricing like?
  • achillesheels 1 hour ago
    Love it! Starbucks Vente Machiato sip

    Love to give it to an arc client, not sure who the right person to implement this would be? Hmm…

  • fithisux 2 hours ago
    Of course it is not working. PDF and images are supposed to be tamper resistant. OCR tries to reverse engineer them.
    • kube-system 1 hour ago
      Since when is tamper resistance a part of PDF or any common image format?
      • pwagland 1 hour ago
        PDF files can be signed, that is tamper resistance. Tamper resistance doesn't have to make any difference to the readability of the document.
        • kube-system 1 hour ago
          So can any type of file -- that doesn't have any relevance to the supposed design of every file type in existence. Now, later versions of PDF do have explicit support for signatures, but what does this have to do with preventing OCR? OCR reads a file, it doesn't change the original file.
          • fithisux 56 minutes ago
            True but you can make modified copies if you reverse engineer it with OCR.
          • ranger_danger 1 hour ago
            Some OCR solutions do change the original file, like OCRmyPDF. They take layers that were just images before and replace it with text layers so that you can search the document.
            • kube-system 1 hour ago
              That isn't OCR, but an application of the resulting output of OCR. Again, a signature on a PDF or any type of file doesn't prevent you from reading it. (It also doesn't technically prevent you from changing it, it just enables the detection of changes to a particular file.)

              There's nothing about PDFs or image formats that prevent anyone from doing OCR. The reason construction documents are difficult to OCR is because OCR models are not well trained for them, and they're very technical documents where small details are significant. It doesn't have anything to do with the file format

        • ranger_danger 1 hour ago
          Can't one just remove the signature and re-sign it with anything else after tampering? Who verifies PDFs that hard?
          • kube-system 1 hour ago
            If you're performing OCR, you're almost by definition, disregarding the source file. The whole point of OCR is to be transformative.
      • fithisux 57 minutes ago
        You can't change a PDF, it is by design to be not easy to OCRed
        • kube-system 7 minutes ago
          PDFs are merely an collection of objects, that can be plainly read by reading the file -- some of those are straight up plain text that doesn't even need to be OCR'd, it can be simply extracted. It is also possible to embed image objects in PDFs, (this is common for scanned files) which might be what you are thinking of. But this is not a design feature of PDF, but rather the output format of a scanner: an image. Editing PDFs is a simple matter of simply editing a file, which you can do plainly as you would any other.
  • ware-intel 52 minutes ago
    Your smart features looks like a game changer? Nice job!