Lucky Boost

Boost Traffic & Conversions

Exporting Data from Webmaster Tools via API

| 24 Comments

Getting all your data out of Google webmaster tools is usually very easy. All you need to do is export your data as a CSV or have it opened in a Google Doc.get-webmaster-tool-data This is as long as you maintain you website properly because you won’t have much to export. I’m sorry to say that this is not the case with many websites on the web. Especially if the owner updates the site every now and then, the odds to find a mess reported in Google webmaster tools, just sky rockets. This does not change much even if a “professional” redesigns the site.

So now we have access to a sites profile on Webmaster Tools and of course we find hundreds of errors. (As long as you are under 1000 errors you won’t have any problem exporting the data) To export this list you will need to use the Webmaters Tools API. A past coworker of mine Mark Ginsberg wrote a post about it on seomoz, Liberating Your Data from Google Webmaster Tools – a Step-by-Step Guide. I myself used this post many times to get information from Webmasters tools, but a few weeks ago when I ran the script I got an error trying to export data of CRAWL_ERRORS, CONTENT_ERRORS, and SOCIAL_ACTIVITY files. I was mainly interested in the error reports so I was disappointed when I only got TOP_PAGES and TOP_QUERIES exported.

I started looking around for a solution for the script but besides more users asking for help with the same problem I was unable to find an update to get the script to run.

So with no working script and thousands of errors I needed to find a different script.

I found a post by karthikeyan Thangavel Getting Crawl Errors Using Google Web Master Tools API with a Java class based on an example provided by Google. I made a bunch of changes to the code mainly so it won’t print extra data and to have it save the data to a CSV as well as printing to the console.

For all the Java e-literates I’m providing a step by step guide to get your data out of webmaster tools.

To run Java you need a platform. I went with Eclipse, the one I’m most familiar with. Getting Eclipse up and running can be a big pain, I hope I can get you through it. The guide is for Windows users only.( At the end of the instruction there is a slide show with some images hope it helps)

  1. Download Eclipse – go for the classic version.
  2.  Extract the folder and put it someplace on your computer (I put it in C) – no installation required.
  3. Create a short cut on the desktop from the file eclipse.exe purple round logo (be careful it isn’t the eclipsec.exe file)
  4. Download the Java JDK – click on download JDK, you will need to Accept License Agreement before you can download.
  5. Download gdata – Google Data Java client library (only the first link) and guava Google core libraries for java (only first jar file listed). Extract and put both folders where it’s good for you. (I put it in C)
  6. Run installation for JDK. At the end a web page will open asking you to register, you don’t need to register.
  7. Launching Eclipse can tack some time. As long as a message box dose not pop up saying you are missing something, you are OK. If a message dose come up I’m sorry to say I don’t know how to help you solve the problem, it can be one of many different things. Try Google for help.
  8. You will get a request to Select a workplace – just give them a location that is convenient for you. (slide #3)
  9. Wait until Eclipse opens.
  10. Start a new Java project. Go to File >> New >> Java Project (slide #5)
  11. Give it a name (I named it wmt-error) and click next. (slide #6)
  12. On the top of the screen you will see 4 tabs, we need the third tab “Libraries”.
  13. On the right choose “Add External JARs…” (slide #7)
  14. Navigate to the location of gdata you downloaded previously.
  15. Go to the folder gdata/java/lib/ and choose all the files and press Open at the bottom. (slide #8)
  16. Repeat step 15 for guava library (only 2 files and they are not in a lib folder). (slide #9)
  17. Finally click on Finish (if you don’t see a screen divided in to 3 and your project name on the top right, just close the welcoming screen)
  18. Now you need to create a class. Go to File >> New >> Class (slide #10)
  19. Name the Class “get_error_wmt” (if you name it something else you will get an error when trying to run the code) and click Finish. (slide #11)
  20. Delete code on the screen and paste the following code.
  21. To make it easier to edit the code we will add row numbers. Select from top bar:  Windows  >> Preference > > General >> Editors >> Text Editors, check on the “Show line numbers” option. (slide #13)

 

 

Now that we have row numbers

  1. Go to row number 24 and put the address for the site you want to get errors for (must be exactly the same url that appears in webmasters tools and must end with a slash)
  2. On rows 83 and 84 put in your email address you use to log in to Google and your Password.
  3. On row 59 you can change the name and path of the file that will be saved.
  4. Press the right arrow in the green circle from top bar. This will run the project.

Go fix all those errors!

If you did something wrong you will see red text in the console area. If you can’t figure out the problem you are invited to post a comment and I’ll try to help.

BTW if you run the code for different URLs it won’t override the content in the file it will just add them below the content that is already there with a title of Crawl Issues for site: “site url”.

This code was written to export only the errors but if you’d like to export other information all you need to do is change the feed. The feed being used is one of the feeds Google provides for webmasters tools use.

The feed can be found on row 29. If you look at row 39 you will see that we add on to the feed address “/crawlissues/” which is unnecessary for the other feeds so don’t forget to remove “/crawlissues/” and the following “+” sign.

Good luck with exporting your data!

24 Comments

  1. Work like a charm!

  2. Hi, ran into this error:

    Unable to retrieve the feed. Server unavailable.
    com.google.gdata.util.ResourceNotFoundException: Not Found
    Site not found

    at com.google.gdata.client.http.HttpGDataRequest.handleErrorResponse(HttpGDataRequest.java:599)
    at com.google.gdata.client.http.GoogleGDataRequest.handleErrorResponse(GoogleGDataRequest.java:564)
    at com.google.gdata.client.http.HttpGDataRequest.checkResponse(HttpGDataRequest.java:560)
    at com.google.gdata.client.http.HttpGDataRequest.execute(HttpGDataRequest.java:538)
    at com.google.gdata.client.http.GoogleGDataRequest.execute(GoogleGDataRequest.java:536)
    at com.google.gdata.client.Service.getFeed(Service.java:1135)
    at com.google.gdata.client.Service.getFeed(Service.java:998)
    at com.google.gdata.client.GoogleService.getFeed(GoogleService.java:645)
    at com.google.gdata.client.Service.getFeed(Service.java:1017)
    at get_error_wmt.main(get_error_wmt.java:100)

  3. Hi Menachem, the cited blog post from Mark is referencing my PHP script “GWTdata” which used to process data from the WMT web interface (since most data were not available through official APIs). This script stopped working due to cahnges by Google to the web interface. Anyway, I released a follow up project to receive the crawl errors as CSV file again. So, for those interested in a PHP solution, you can head over to the project site here: https://github.com/eyecatchup/GWT_CrawlErrors-php

    Cheers

  4. I get the following error
    Unable to retrieve the feed. Server unavailable.
    com.google.gdata.util.ParseException: [Line 1, Column 16033, element wt:issue-type] Invalid value for attribute : ‘null’

    This occurs while trying to pull Soft404 issue type as the issue type field for these are blank.
    Is there a way to fix this?

  5. this works great. thank you :)
    Is there a way to input a date range?

  6. thanks, so for what period is the current report running for.

  7. Thanks for this post. I ran the code and the counsel ran without any red errors.

    I’m getting lines that say:

    Crawl Type: MOBILE_CHTML_CRAWL
    Issue Type: RESTRICTED_ROBOTS_TXT
    Url: mysite.com/cookiesdisabled.aspx
    Detail: URL restricted by robots.txt

    And I don’t see the excel file in the location I specified.

    Any ideas?

  8. Thanks for taking your time explaining all this process, any idea how the site id is for the feed ?

    https ://www.google.com/webmasters/tools/feeds/SITE_ID/keywords/

  9. Hi, this is a great post – have exported the errors. Perfect!

    Is there a way that I can export all the links from GWT? I have 22,000 and am looking to see if I can export all the links in this way?

    Cheers!

  10. Hi Menachem,
    I’ve followed your steps, I got the following errors:
    http://screencast.com/t/975fOgjQn3t

    When installed, I couldn’t follow step 16:
    16.Repeat step 15 for guava library (only 2 files and they are not in a lib folder). (slide #9)

    There were no libraries available for load in the respective unzipped folder.

    What should I do?
    Thank you in advance,
    Arthur

    • I see some changes were made to the guava library, there is no folder to download only 2 files.
      All you need to do now is download the 2 files from http://code.google.com/p/guava-libraries/ under the row reading “You can download a JAR at:” and add them to your project.

      good luck exporting your errors

      • hi Menachem,
        Thank you again for your effort. Now it worked like a charm!
        May I humbly suggest you some small modifications in the tutorial to become easier to understand?

        5. should become:
        Download gdata – Google Data Java client library (only the first link). Extract and put both folders where it’s good for you. (I put it in C)

        6. (this will be a new point 6, all the rest of the points should be incremented by 1):
        Download guava – Google core libraries for java (both jar file listed as links). Put them as they are (do not extract them) where it’s good for you. (I put them both in C as well)

        16. should become:
        Repeat step 15 for guava jar libraries (both jar files you downloaded at point 6. should be imported in the same manner)

        21. (this number should be attributed to the text: “Now that we have row numbers” and I suggest the following modifications:
        1. Go to row number 24 -> should become -> 1. Go to row number 23
        2. On rows 83 and 84 -> should become -> 2. On rows 82 and 83
        3. On row 59 -> should become -> 3. On row 58
        I would add a point 4. (tested by me, I succesfully retrieved 225,064 pages): On row 80, you could increase the MAX_PAGES value from 100 up to 10000 (that means that 10000*100 = 1,000,000 pages retrieved).

        I think the rest is pretty straight forward.
        Thank you again :)
        Arthur

  11. I made it all the way to the last step. I can access the API, pull the data, and see it streaming in through the console.

    However, my program terminates abruptly at the “saveCrawlIssueEntry” stage – this information appears as “red” in the outline view. I’m not seeing any errors appear in the console.

    It sounds like the exact same problem that Elie describes in the comments above – the code executes and data streams in, but the data doesn’t save. I also notice that the data in the console provides details on the GWT errors, but I’m not seeing the crawl error source coming in – here’s an example entry:

    Id: https: //www.google.com/webmasters/tools/feeds/http%3A%2F%2Fwww.website.com%2F/crawlissues/10000
    Crawl Type: WEB_CRAWL
    Issue Type: NOT_FOUND
    Url: http: //www.website.com/filename
    Detail: 404 (Not found)

    (Website name changed).

    Anyone else seeing this or a work around?

  12. Charlotte,
    Make sure the program is allowed (have enough rights) to write in C:\ where the filse should be created. Perhaps you wish to right click and run the Eclipse as Administrator and see if you have the same problems.
    Arthur

  13. Thanks, Arthur. I gave it a try and also tried changing the download location but the script still terminated partway through execution.

    I’ve also tried the PHP method and the script fails to execute because my computer runs out of memory (even if I increase the size of the php.ini file). Perhaps the Java version is failing for the same reason, memory limitations.

    (Also, your notes above on the changes w/r/t the gdata files were also helpful!)

  14. As it turns out, the program terminated due to size issues. The program was trying to download errors for a site with a very large number of URLs blocked by the robots.txt file, which caused it to time out.

    Charlotte

Leave a Reply

Required fields are marked *.